🔗 How when AWS was down, we were not

devops reading-list

368 words, 2 min read

⚠️ This post links to an external website. ⚠️

One of the most massive AWS incidents transpired on October 20th. The long story short is that the DNS for DynamoDB was impacted for us-east-1, which created a health event for the entire region. It's the worst incident we've seen in a decade. Disney+, Lyft, McDonald'ss, New York Times, Reddit, and the list goes on were lining up to claim their share too of the spotlight. And we've been watching because our product is part of our customers critical infrastructure. This one graph of the event says it all:

The AWS post-incident report indicates that at 7:48 PM UTC DynamoDB had "increased error rates". But this article isn't about AWS, and instead I want to share how exactly we were still up when when AWS was down.

Now you might be thinking: why are you running infra in us-east-1?

And it's true, almost no one should be using us-east-1, unless, well, of course, you are us. And that's because we end up running our infrastructure where our customers are. In theory, practice and theory are the same, but in practice they differ. And if our (or your) customers chose us-east-1 in AWS, then realistically, that means you are also choosing us-east-1 😅.

During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there. And even without a direct dependency on us-east-1, there are critical services in AWS — CloudFront, Certificate Manager, Lambda@Edge, and IAM — that all have their control planes in that region. Attempting to create distributions or roles at that time were also met with significant issues.

Since there are plenty of articles in the wild talking about what actually happened, why it happened, and why it will continue to happen, I don't need to go into it here. Instead, I'm going to share a dive about exactly what we've built to avoid these exact issues, and what you can do for your applications and platforms as well. In this article, I'll review how we maintain a high SLI to match our SLA reliability commitment even when the infrastructure and services we use don't.

continue reading on authress.io

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.