Behind the AWS Blackout: How a DNS Glitch Exposed Cloud Dependency Risks

Introduction

On October 20, 2025, Amazon Web Services (AWS) experienced a significant global outage that disrupted thousands of websites and applications across multiple sectors. The incident, concentrated in the US-EAST-1 (Northern Virginia) region, caused widespread disruptions from early morning hours and was fully mitigated by mid-afternoon.

Timeline of Events

12:11 a.m. ET: AWS reported elevated error rates affecting 14 services, including EC2, S3, DynamoDB, IAM, and Route 53 in the US-EAST-1 region.
2:00 a.m. ET: AWS identified DNS resolution issues affecting the DynamoDB API endpoint, disrupting dependent services like Lambda, CloudTrail, and API Gateway.
3:35 a.m. PT (6:35 a.m. ET): AWS applied mitigations; most services began recovering.
Late morning: Residual slowdowns persisted for Bedrock and RDS users; full recovery was observed around 12:30 p.m. ET.

Root Cause Analysis

AWS confirmed the outage stemmed from a DNS resolution failure linked to DynamoDB’s endpoint in US-EAST-1. This caused cascading API request failures across globally reliant services (e.g., IAM and DynamoDB Global Tables). AWS engineers indicated it was not security-related, but rather an internal infrastructure misconfiguration affecting internal DNS propagation.

Affected Services and Regions

Beyond EC2 and S3, the disruption rippled through:

AWS Bedrock (AI model endpoints temporarily unreachable)
RDS and CloudFront (intermittent access and replication failures)
Route 53 (DNS propagation delays)
IAM (authorization bottlenecks) While primarily centered in US-EAST-1, user reports spiked globally—particularly in Europe, Japan, and Australia—with over 6.5 million outage reports recorded.

Industry Impact

Banking & Fintech: Coinbase and Robinhood trading platforms temporarily halted transactions.
E-commerce: Amazon, Shopify, and Etsy order processing systems disrupted; checkout API latencies exceeded 3 seconds on average.
SaaS Platforms: Canva, Notion, and Zoom experienced degraded performance.
AI Workloads: Bedrock-based applications (including Perplexity AI) saw slow inferencing and retraining jobs paused.
Government Workloads: GovCloud dependencies reported short-lived API throttling but recovered by 10 a.m. ET.

Monitoring and Provider Responses

AWS Statement: Confirmed recovery was in progress, citing DNS failure in DynamoDB as root cause and full restoration by afternoon.
Datadog & Dynatrace: Both noted incomplete telemetry during the event due to AWS dependence.
Cloudflare: Reported normal edge operations, confirming the fault was internal to AWS DNS infrastructure.

Security and Cascading Effects

No evidence of cyberattack or data corruption was found. However, temporary IAM authorization mismatches and API Gateway rate throttling disrupted authentication services for some customers. Bedrock latency and Lambda backlogs triggered delayed automation pipelines across AI-driven workflows.

Historical Comparison

December 2021: 12/7/2021 | ~7 hours | Network congestion | US-EAST-1 | Major e-commerce and media disruption
June 2023: 6/13/2023 | ~2 hours | Authentication service overload | US-EAST-2 | Moderate SaaS and IoT impact
October 2025: 10/20/2025 | ~10 hours (intermittent) | DNS resolution failure | US-EAST-1 | Global — Banking, AI, SaaS, and retail

Expert Commentary & Implications

Industry analysts argue this outage underscores single-region dependency risks and the fragile coupling between DNS and modern distributed applications. Gartner analysts noted enterprises using multicloud redundancy or failover DNS like Cloudflare’s D1 or Azure Traffic Manager maintained higher uptime. CIOs and CTOs are advised to:

Architect failover for critical APIs outside a single AWS region.
Deploy independent DNS and monitoring layers.
Stress-test IAM and event-driven Lambda automations against service lag conditions.

Key Takeaways for Executives

The October 2025 outage was a DNS-level failure, not a cyberattack.
US-EAST-1 single-region dependency continues to be an enterprise continuity concern.
Companies with multicloud resilience experienced minimal downtime.
AWS recovery visibility is improving, but communication transparency remains a critical customer expectation.

Overall Impact: Temporary but severe — approximately 20% of global internet traffic was degraded or interrupted for up to 8 hours. AWS has since stated it is implementing “enhanced DNS partitioning and inter-region failover automation” to prevent recurrence.

About the Author

Rejith Krishnan

Rejith Krishnan is the Founder and CEO of lowtouch.ai, a platform dedicated to empowering enterprises with private, no-code AI agents. With expertise in Site Reliability Engineering (SRE), Kubernetes, and AI systems architecture, he is passionate about simplifying the adoption of AI-driven automation to transform business operations.

Rejith specializes in deploying Large Language Models (LLMs) and building intelligent agents that automate workflows, enhance customer experiences, and optimize IT processes, all while ensuring data privacy and security. His mission is to help businesses unlock the full potential of enterprise AI with seamless, scalable, and secure solutions that fit their unique needs.

About lowtouch.ai

lowtouch.ai delivers private, no-code AI agents that integrate seamlessly with your existing systems. Our platform simplifies automation and ensures data privacy while accelerating your digital transformation. Effortless AI, optimized for your enterprise.

Schedule a Demo

2025

Agentic AI

Join Us

2nd – 3rd October

New York City, USA

Promptstash

Chrome extension to manage and deploy AI prompt templates.

Get Promptstash

works with chatgpt, grok etc

Effortless way to save and reuse prompts

No-Code Agentic Products

Private AI Appliance

Private AI Infrastructure

AI Center of Excellence

Prebuilt Agents

Build Custom Agents

Featured Articles

lowtouch.ai for Datacenters: Unlocking AI-Powered Business Transformation

Behind the AWS Blackout: How a DNS Glitch Exposed Cloud Dependency Risks

Behind the AWS Blackout: How a DNS Glitch Exposed Cloud Dependency Risks

Introduction

Timeline of Events

Root Cause Analysis

Affected Services and Regions

Industry Impact

Monitoring and Provider Responses

Security and Cascading Effects

Historical Comparison

Expert Commentary & Implications

Key Takeaways for Executives

About lowtouch.ai

Stay Ahead with the Latest in Agentic AI!

lowtouch.ai — Built by the innovators at CloudControl