Complete unavailability of gcp-us-central1
Incident Report for ESS (Public)
Postmortem
Incident Date 2021-02-08
Reference Numbers CLOUD:74936
Incident Start Date (UTC) 2021-02-08T18:51
Incident End Date (UTC) 2021-02-08T20:40
Incident Duration 1:48:32
Impact Start Date (UTC) 2021-02-08T18:31
Impact End Date (UTC) 2021-02-08T19:17
Impact Duration 46 minutes

Summary

Impact

A complete loss of network connectivity (the “proxy layer” or “proxies”) to customer deployments in the GCP us-central1 (Iowa) region for 46 minutes. 

Root Cause(s)

  1. During a manual scale-down of six proxies in GCP us-central1 (two proxies in each of three configured availability zones), a bug in the GCP User Console disassociated all proxy Instance Groups from the Target Pool instead of removing the desired proxies from the Instance Groups. This was the root cause of impact.
  2. The impact duration was longer than anticipated for the following reasons:

    1. Time-to-detect was 6 minutes:

      1. SLA monitoring triggered an alert, but it did not page the on-call team for awareness.
      2. Alerts for our monitoring clusters being unavailable indirectly brought attention to the incident.
    2. Time-to-respond was 14 minutes due to misunderstanding the scope of customer impact due to indirect alerts (see above).

    3. Time-to-remediate was 26 minutes:

      1. Six (6) minutes: Our logging clusters were not reachable since they depend on the proxy layer that was impacted, making it more difficult to triage the problem.
      2. 17 minutes: Initial attempts to add more proxies failed since the Instance Groups were not connected to the Target Pool. Additional discovery was needed after that effort failed to reconcile the problem.
      3. Three (3) minutes: Reassociating the proxy Instance Groups to the Target Pool took three (3) minutes to complete. This solved the problem.

Resolution Action Items

Reassociating the proxy Instance Groups to the Target Pool remediated the problem. It took three (3) minutes before enough proxies were reintroduced to successfully manage the traffic.

Timeline

UTC Date & Time Event
2021-02-08T18:24 UTC Change made by operator using the GCP User Console to drain six (6) proxies from Target Pool in us-central1 region
2021-02-08T18:31 UTC Impact starts according to SLA dashboards
2021-02-08T18:34 UTC GCP logs show completed removal of all Instance Groups from Target Pool
2021-02-08T18:37 UTC SLA alert triggered regarding loss of connectivity in GCP us-central1 region
2021-02-08T18:51 UTC Incident created for loss of connectivity to GCP us-central1 region
2021-02-08T18:58 UTC Investigation by Elastic engineers discovers proxy Instance Groups were disassociated from Target Pool, resulting in no available hosts to handle traffic
2021-02-08T19:03 UTC Elastic engineers begin adding new proxies to all zones to resolve problem (this solution does not work)
2021-02-08T19:08 UTC Incident posted on Elastic Cloud Status
2021-02-08T19:14 UTC Elastic engineers reassociate the Instance Groups with the Target Pool (this solution  works)
2021-02-08T19:17 UTC Enough proxy instances reintroduced to Target Pool to handle expected traffic.  
2021-02-08T21:29 UTC After replicating the behavior in a non-production environment, Elastic engineers opened a ticket with cloud provider regarding that change that was made and the unexpected result.
2021-02-08T22:20 UTC Incident marked as resolved on Elastic Cloud Status
2021-02-10T23:52 UTC Cloud provider confirmed observed behavior of disassociating Instance Groups from load balancer Target Pool when attempting to remove individual instances was bug in GCP User Console.

Action Items

Completed

  • Reduce time-to-respond: Ensure SLA alerts page on-call engineers when triggered.

Work in Progress

  • Mitigate failure: A support ticket was filed with GCP regarding the User Console Bug. It has been acknowledged and reproduced by them. There is currently no ETA for when it will be resolved.

Future Work

  • Mitigate failure: Implement process, tooling, and documentation to ad-hoc scale-down proxy layer that avoids the cloud provider bug in GCP User Console
  • Mitigate impact: Improve the proxy layer autoscaling to reduce the need for ad-hoc scaling
  • Reduce time-to-detect: Improve alerting during total loss of proxy connectivity in a region
  • Reduce time-to-remediate: Improve availability of monitoring clusters during incidents involving proxy layer
  • Improve communication: Improve automation of Elastic Cloud Status updates during incidents to reduce friction and time for incident updates
  • Improve communication: Update incident management process to communicate via Elastic Cloud Status when an RCA is expected to be posted for qualifying incidents
Posted Feb 17, 2021 - 22:11 UTC

Resolved
This incident has been resolved.
Posted Feb 08, 2021 - 22:20 UTC
Update
We have discovered and fixed the issue with GCP Loadbalancer configuration that led to all proxy instances being dissociated from the Loadbalancer, and a complete outage of the region. As of 02/08 2:16 ET, the issue has been resolved and the region should be accessible again.
Posted Feb 08, 2021 - 19:19 UTC
Investigating
Starting 8/2 1:17PM ET, we are experiencing an outage of our proxy layer in gcp-us-centra1 region which prevents any requests or responses from being routed to/from all deployments in the region. We do not have any more details at the time. Next update will be provided in 10min.
Posted Feb 08, 2021 - 19:08 UTC
This incident affected: GCP Iowa (us-central1) (Deployment metrics: GCP us-central1).