![]() ![]() ![]() When the event occurred, all five chillers in operation faulted and restart because the corresponding pumps did not get the run signal from the chillers. The cooling capacity for the two affected data halls consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2) before the voltage dip event. This resulted in a loss of service availability for a subset of this Availability Zone. At 11:34 UTC, infrastructure thermal warnings from components in the affected data halls directed a shutdown of selected compute, network and storage infrastructure – by design, to protect data durability and infrastructure health. The cooling capacity was reduced in two data halls for a prolonged time, so temperatures continued to rise. We performed our documented Emergency Operational Procedures (EOP) to attempt to bring the chillers back online, but were not successful. Starting at approximately 08:41 UTC on 30 August 2023, a utility power sag in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. A small number of these services experienced prolonged impact, predominantly as a result of dependencies in recovering subsets of Storage, SQL, and/or Cosmos DB services. Multiple downstream Azure services with dependencies on this infrastructure were also impacted – including Activity Logs & Alerts, API Management, App Service, Application Insights, Arc enabled Kubernetes, Azure API for FHIR, Backup, Batch, Chaos Studio, Container Apps, Container Registry, Cosmos DB, Databricks, Data Explorer, Data Factory, Database for MySQL flexible servers, Database for PostgreSQL flexible servers, Digital Twins, Device Update for IoT Hub, Event Hubs, ExpressRoute, Health Data Services, HDInsight, IoT Central, IoT Hub, Kubernetes Service (AKS), Logic Apps, Log Analytics, Log Search Alerts, Microsoft Sentinel, NetApp Files, Notification Hubs, Purview, Redis Cache, Relay, Search, Service Bus, Service Fabric, SQL Database, Storage, Stream Analytics, Virtual Machines. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, in an attempt to avoid damage to hardware. This event was triggered by a utility power sag in the Australia East region which tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.īetween approximately 08:41 UTC on 30 August 2023 and 06:40 UTC on 1 September 2023 customers may have experienced issues accessing or using Azure, Microsoft 365 and Power Platform services. This is our "Preliminary" PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |