Some hosts in Azure Australia East zone 3 are unreachable

Incident Report for Elastic Cloud (Public)

Resolved

This incident has been resolved.

Posted Aug 31, 2023 - 02:24 UTC

Monitoring

Update from Microsoft at 23:45 UTC (more details https://azure.status.microsoft/en-us/status):
Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware. Multiple downstream services were impacted, with targeted communications being distributed via Azure Service Health.

Current Status: Having successfully recovered 99% of storage services and 99% of impacted Virtual Machines, we are actively investigating individual downstream services to confirm their recovery status and mitigate remaining issues. At this stage, we believe most downstream services that are still experiencing impact are the result of dependencies on one of three services with investigations ongoing. Firstly, our Storage team are making progress with the final remaining storage scale unit that is still experiencing isolated issues - we have engaged our onsite datacenter team to support replacing drives as needed. Secondly, our SQL team are working to mitigate one final cluster that is experiencing a capacity issue, due to several Service Fabric nodes that have not fully recovered - we are rebalancing capacity to mitigate. Finally, our Cosmos DB team continue to investigate why some services have not yet recovered fully. While the majority of customers and the majority of services are already mitigated, further updates on these remaining investigations will be provided in 60 minutes, or as events warrant.

On our side, 100% of the affected hosts are back up. We'll continue monitoring the situation and provide an update in the next 2 hours.

Posted Aug 31, 2023 - 00:10 UTC

Update

Update from Microsoft at 22:28 UTC (more details https://azure.status.microsoft/en-us/status):
Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware. Multiple downstream services were impacted, with targeted communications being distributed via Azure Service Health. Impact to services is limited to Australia East, except for Azure Kubernetes Service (AKS) which has impact in both Australia East and Australia Southeast due to a dependency in the former.

Current Status: With 99% of storage services and 99% of impacted Virtual Machines back online and healthy, we are now supporting individual downstream services to confirm their recovery status. We are aware of one specific storage scale unit that is still experiencing isolated issues, but the majority of customers and services should already be recovered. Beyond this known storage issue, we are investigating which services are still not fully mitigated and why. Further updates will be provided in 60 minutes, or as events warrant.

On our side, 95% of the affected hosts are back up. There are still a handful of affected deployments that are running in a degraded state. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 22:44 UTC

Update

Update from Microsoft at 20:03 UTC (more details https://azure.status.microsoft/en-us/status):

We are in the final phases of restoring core services, and expect that the vast majority of remaining impacted services should be back online in the next hour. After restoring power and stabilizing temperatures, all network infrastructure and 99% of storage services are back online. All premium disk storage has fully recovered, we continue to work towards mitigating the final remaining storage devices. The vast majority of underlying compute services are back online, with more than 99% of Virtual Machines (VMs) that were impacted now back online and healthy.
While many customers and services have already recovered, we are now prioritizing our investigations with the remaining downstream impacted services. We expect that these remaining services should be back online and healthy within the next hour. Further updates will be provided in 60 minutes, or as events warrant.

On our side, 95% of the affected hosts are back up. There are still a handful of affected deployments that are running in a degraded state. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 20:49 UTC

Update

Update from Microsoft (more details https://azure.status.microsoft/en-us/status):

Mitigation efforts are continuing, we have made significant progress in restoring core services but we are not able to provide a mitigation ETA at this time. Power to all hardware has been restored, temperatures in the impacted datacenter have stabilized. All network infrastructure is back online. The majority of storage devices are back online, we are validating issues with a few remaining storage nodes. The majority of underlying compute services are back online, with more than 75% of Virtual Machines that were impacted back online and healthy. While many customers of these core services have seen signs of recovery, we continue to work with downstream impacted services to ensure that they are coming back online as expected.

On our side, we see that 30% of the affected hosts are back up. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 18:47 UTC

Update

Update from Microsoft (more details https://azure.status.microsoft/en-us/status):

Azure have indicated that the vast majority of network infrastructure is back online, and storage device recovery has started. Due to the nature of this issue, storage scale units are expected to require additional recovery efforts to ensure that all resources return in a consistent state. As service recovery continues, some customers may start experiencing signs of recovery.

All hosts that were affected by the outage are still affected. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 17:36 UTC

Update

Failed hosts are limited to Australia East zone 3. Kibana and Enterprise Search instances in this zone have been restored to zone 1 or 2 to mitigate impact for deployments of Elasticsearch that have instances in zone 1 or 2. Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 15:23 UTC

Update

Azure notified that temperature in the impacted datacenter have been stabilized. Azure engineers started to work on the restoration of Compute and Storage. More details in https://azure.status.microsoft/en-us/status . Next update will be provided in 2 hours or as soon as we have more to share.

Posted Aug 30, 2023 - 13:51 UTC

Update

Azure engineers are reporting about cooling issues in azure-australiaeast. Azure engineers are actively working to mitigate the temperature issues in the datacenter. Currently there is no ETA to share for restoration of the impacted scale units.

Posted Aug 30, 2023 - 12:44 UTC

Identified

Azure has acknowledged an issue and is actively investigating.

Posted Aug 30, 2023 - 11:46 UTC

Investigating

Some hosts in Azure Australia East zone 3 are unreachable. We are observing degrade performance for clusters that having instances allocated in this AZ.

We're currently investigating the issue and will provide further update within the next 30 minutes.

Posted Aug 30, 2023 - 11:42 UTC

This incident affected: Azure New South Wales (azure-australiaeast) (Elasticsearch connectivity: Azure azure-australiaeast, APM connectivity: Azure azure-australiaeast, Deployment orchestration (Create/Edit/Restart/Delete): Azure azure-australiaeast, Kibana connectivity: Azure azure-australiaeast, Azure Infrastructure health: azure-australiaeast).