Incident Date | 2021-02-08 |
---|---|
Reference Numbers | CLOUD:74936 |
Incident Start Date (UTC) | 2021-02-08T18:51 |
Incident End Date (UTC) | 2021-02-08T20:40 |
Incident Duration | 1:48:32 |
Impact Start Date (UTC) | 2021-02-08T18:31 |
Impact End Date (UTC) | 2021-02-08T19:17 |
Impact Duration | 46 minutes |
A complete loss of network connectivity (the “proxy layer” or “proxies”) to customer deployments in the GCP us-central1 (Iowa) region for 46 minutes.
The impact duration was longer than anticipated for the following reasons:
Time-to-detect was 6 minutes:
Time-to-respond was 14 minutes due to misunderstanding the scope of customer impact due to indirect alerts (see above).
Time-to-remediate was 26 minutes:
Reassociating the proxy Instance Groups to the Target Pool remediated the problem. It took three (3) minutes before enough proxies were reintroduced to successfully manage the traffic.
UTC Date & Time | Event |
---|---|
2021-02-08T18:24 UTC | Change made by operator using the GCP User Console to drain six (6) proxies from Target Pool in us-central1 region |
2021-02-08T18:31 UTC | Impact starts according to SLA dashboards |
2021-02-08T18:34 UTC | GCP logs show completed removal of all Instance Groups from Target Pool |
2021-02-08T18:37 UTC | SLA alert triggered regarding loss of connectivity in GCP us-central1 region |
2021-02-08T18:51 UTC | Incident created for loss of connectivity to GCP us-central1 region |
2021-02-08T18:58 UTC | Investigation by Elastic engineers discovers proxy Instance Groups were disassociated from Target Pool, resulting in no available hosts to handle traffic |
2021-02-08T19:03 UTC | Elastic engineers begin adding new proxies to all zones to resolve problem (this solution does not work) |
2021-02-08T19:08 UTC | Incident posted on Elastic Cloud Status |
2021-02-08T19:14 UTC | Elastic engineers reassociate the Instance Groups with the Target Pool (this solution works) |
2021-02-08T19:17 UTC | Enough proxy instances reintroduced to Target Pool to handle expected traffic. |
2021-02-08T21:29 UTC | After replicating the behavior in a non-production environment, Elastic engineers opened a ticket with cloud provider regarding that change that was made and the unexpected result. |
2021-02-08T22:20 UTC | Incident marked as resolved on Elastic Cloud Status |
2021-02-10T23:52 UTC | Cloud provider confirmed observed behavior of disassociating Instance Groups from load balancer Target Pool when attempting to remove individual instances was bug in GCP User Console. |