Complete unavailability of gcp-us-central1

Incident Report for ESS (Public)

Postmortem

Incident Date	2021-02-08
Reference Numbers	CLOUD:74936
Incident Start Date (UTC)	2021-02-08T18:51
Incident End Date (UTC)	2021-02-08T20:40
Incident Duration	1:48:32
Impact Start Date (UTC)	2021-02-08T18:31
Impact End Date (UTC)	2021-02-08T19:17
Impact Duration	46 minutes

Summary

Impact

A complete loss of network connectivity (the “proxy layer” or “proxies”) to customer deployments in the GCP us-central1 (Iowa) region for 46 minutes.

Root Cause(s)

During a manual scale-down of six proxies in GCP us-central1 (two proxies in each of three configured availability zones), a bug in the GCP User Console disassociated all proxy Instance Groups from the Target Pool instead of removing the desired proxies from the Instance Groups. This was the root cause of impact.
The impact duration was longer than anticipated for the following reasons:
1. Time-to-detect was 6 minutes:
  1. SLA monitoring triggered an alert, but it did not page the on-call team for awareness.
  2. Alerts for our monitoring clusters being unavailable indirectly brought attention to the incident.
2. Time-to-respond was 14 minutes due to misunderstanding the scope of customer impact due to indirect alerts (see above).
3. Time-to-remediate was 26 minutes:
  1. Six (6) minutes: Our logging clusters were not reachable since they depend on the proxy layer that was impacted, making it more difficult to triage the problem.
  2. 17 minutes: Initial attempts to add more proxies failed since the Instance Groups were not connected to the Target Pool. Additional discovery was needed after that effort failed to reconcile the problem.
  3. Three (3) minutes: Reassociating the proxy Instance Groups to the Target Pool took three (3) minutes to complete. This solved the problem.

Resolution Action Items

Reassociating the proxy Instance Groups to the Target Pool remediated the problem. It took three (3) minutes before enough proxies were reintroduced to successfully manage the traffic.

Timeline

UTC Date & Time	Event
2021-02-08T18:24 UTC	Change made by operator using the GCP User Console to drain six (6) proxies from Target Pool in us-central1 region
2021-02-08T18:31 UTC	Impact starts according to SLA dashboards
2021-02-08T18:34 UTC	GCP logs show completed removal of all Instance Groups from Target Pool
2021-02-08T18:37 UTC	SLA alert triggered regarding loss of connectivity in GCP us-central1 region
2021-02-08T18:51 UTC	Incident created for loss of connectivity to GCP us-central1 region
2021-02-08T18:58 UTC	Investigation by Elastic engineers discovers proxy Instance Groups were disassociated from Target Pool, resulting in no available hosts to handle traffic
2021-02-08T19:03 UTC	Elastic engineers begin adding new proxies to all zones to resolve problem (this solution does not work)
2021-02-08T19:08 UTC	Incident posted on Elastic Cloud Status
2021-02-08T19:14 UTC	Elastic engineers reassociate the Instance Groups with the Target Pool (this solution works)
2021-02-08T19:17 UTC	Enough proxy instances reintroduced to Target Pool to handle expected traffic.
2021-02-08T21:29 UTC	After replicating the behavior in a non-production environment, Elastic engineers opened a ticket with cloud provider regarding that change that was made and the unexpected result.
2021-02-08T22:20 UTC	Incident marked as resolved on Elastic Cloud Status
2021-02-10T23:52 UTC	Cloud provider confirmed observed behavior of disassociating Instance Groups from load balancer Target Pool when attempting to remove individual instances was bug in GCP User Console.

Action Items

Completed

Reduce time-to-respond: Ensure SLA alerts page on-call engineers when triggered.

Work in Progress

Mitigate failure: A support ticket was filed with GCP regarding the User Console Bug. It has been acknowledged and reproduced by them. There is currently no ETA for when it will be resolved.

Future Work

Mitigate failure: Implement process, tooling, and documentation to ad-hoc scale-down proxy layer that avoids the cloud provider bug in GCP User Console
Mitigate impact: Improve the proxy layer autoscaling to reduce the need for ad-hoc scaling
Reduce time-to-detect: Improve alerting during total loss of proxy connectivity in a region
Reduce time-to-remediate: Improve availability of monitoring clusters during incidents involving proxy layer
Improve communication: Improve automation of Elastic Cloud Status updates during incidents to reduce friction and time for incident updates
Improve communication: Update incident management process to communicate via Elastic Cloud Status when an RCA is expected to be posted for qualifying incidents

Posted Feb 17, 2021 - 22:11 UTC

Resolved

This incident has been resolved.

Posted Feb 08, 2021 - 22:20 UTC

Update

We have discovered and fixed the issue with GCP Loadbalancer configuration that led to all proxy instances being dissociated from the Loadbalancer, and a complete outage of the region. As of 02/08 2:16 ET, the issue has been resolved and the region should be accessible again.

Posted Feb 08, 2021 - 19:19 UTC

Investigating

Starting 8/2 1:17PM ET, we are experiencing an outage of our proxy layer in gcp-us-centra1 region which prevents any requests or responses from being routed to/from all deployments in the region. We do not have any more details at the time. Next update will be provided in 10min.

Posted Feb 08, 2021 - 19:08 UTC