AWS outage affecting over 1,000 organizations blamed on DNS issue

Here’s what we know about yesterday’s AWS outage that impacted more than 1,000 companies worldwide, including Amazon.com and AWS support operations, as well as major social media apps and numerous financial websites and gaming platforms:

AWS said in an Oct. 20 statement the outage was the result of a DNS resolution issue for the regional Dynamo DB service endpoints in its US-EAST-1 Region in northern Virginia.

Scan news coverage of the incident and it becomes clear what AWS did not tell the public: We still don’t know the root cause of the incident.

In other words: what actually caused the DNS resolution issue?

Some publications reported that the DNS issue was caused by an error in a routine update to the DynamoDB API, thus preventing DNS services from finding the correct address for the DynamoDB API.

Was the update a human coding misconfiguration? Or was it a glitch in the machine’s business processes? Or an error by an AI system?

We still don’t know, and security experts say it’s fairly normal for the industry not to have the answers right away. A root cause analysis can take up to three months, and possibly longer given the complexity of the incident.

“AWS has only said that the outage was a DNS resolution failure to the DynamoDB endpoint in one region, and has not published a full root cause yet,” said Jason Soroko, a senior fellow at Sectigo. “In big resolver fleets, the most common way an update goes wrong is a config or code push that changes record data or resolver behavior in a way that tests did not cover, then caching locks in the mistake and magnifying the blast radius.”

Trey Ford, chief strategy and trust officer at Bugcrowd, explained that 66% to 80% of technology service outages are caused by human error or internal process mistakes — and changes in critical services like DNS are extremely hard to anticipate, test, or contain.

“It’s just the nature of these services,” said Ford. “We all want to know how can we anticipate, detect, and prevent DNS issues, and the answer to that question is rather complicated. I would break this into two categories — systems we internally use, and systems we do not control, but depend upon.”

What’s important here is that DNS is a foundational service in that it translates human-readable domain names into machine-readable numerical IP addresses — and when an error occurs given the thousands of organizations that rely on the AWS cloud service, it underscores how interdependent the cloud ecosystem has become.

“The disruption, which ranged from delaying thousands of flights to crippling financial trading platforms, demonstrated AWS's critical role in the modern internet and the fragility of global connectivity,” said Damon Small, a board member at Xcape, Inc.

Small said yesterday's AWS incident highlighted how a small configuration error can quickly escalate into an economic crisis, emphasizing the importance of contingency planning. As reliance on the cloud grows, Small said these single points of failure will have a greater impact, validating the need for cloud diversification strategies.

“The frustrating silence surrounding the AWS DNS outage stems from the fact that it wasn't caused by a hack or AI,” said Small. “Instead, it was a complex technical failure: a small update to DynamoDB's DNS resolution cascaded, leading to a failure in an internal EC2 subsystem that monitors network load balancer health.”

Although AWS hasn't attributed blame to specific individuals, Small said the outage highlights the risks associated with engineering complexity and configuration, issues potentially exacerbated by Amazon's high employee turnover and the reality that the company has laid off more than 27,000 people since 2022.

"Layoffs can raise the odds that a routine change turns into an incident since they drain reviewers and on-call depth and they take tribal knowledge out of the room,” said Sectigo’s Soroko. “But headcount alone does not make the bit level changes on a resolver, so at most it’s an amplifier rather than a trigger."

On the other hand, Cybernews reported that at the end of 2023, Justin Garrison, a senior engineer, left AWS and sharply criticized the large cloud provider in a blog post, saying that the company was already seeing an increase in large scale events and predicted major outages in the near future.

Garrison also said people were leaving AWS in droves, pointing out that many became disenchanted when Amazon's return-to-office policy was enforced after the pandemic. So while at first, Amazon was laying off people mostly in its warehouses, Garrison reported that eventually turned to higher-priced senior engineers, many of whom earn $400,000-$800,000.