Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Why Reliability Engineering Matters: an Analysis of Amazon's Dec 2021 US-East-1 Region Outage

In the field of Chaos Theory, there’s a concept called the Synchronization of Chaos—disparate systems filled with randomness will influence the disorder in other systems when coupled together. From a theoretical perspective, these influences can be surprising. It’s difficult to understand exactly how a butterfly flapping its wings could lead to a devastating tornado. But we often see the influences of seemingly unconnected systems play out in real life.

Podcast: Break Things on Purpose | Carissa Morrow: Learning to be Resilient

Being new in tech an be intimidating! Thankfully, folks like Carissa Morrow are shining examples of how to come into tech from the ground up. Carissa began with a career shift and just started coding, went through the Boise Codeworks bootcamp, and made the jump to tech. Carissa talks about the resilience it took in her early days, and how those experiences reinforced her attitude on continually learning.

Podcast: Break Things on Purpose | Gunnar Grosch: From user to hero to advocate

Reliability and serverless are at the forefront of today’s conversation. For this episode Gunnar Grosch, Senior Developer Advocate at AWS, is here to talk about Chaos Engineering, AWS Serverless, and the work that AWS is doing when it comes to reliability.

If you're adopting Kubernetes, you need Chaos Engineering

When Ticketmaster started their Kubernetes migration, they had to address a huge problem: whenever ticket sales opened for a popular event, as many as 150 million visitors flooded their website, effectively causing distributed denial of service (DDoS) attacks. With new events happening every 20 minutes and $7.6 billion in revenue at stake, outages could mean hundreds of thousands in lost sales.

Getting started with Time Travel attacks

It's the middle of the night when your phone goes off. You rub your eyes and unlock the screen to see a SEV 1 alert from your incident management tool. The application is down, multiple cloud server instances are offline, and the remaining instances are being overwhelmed by the sudden increase in demand. You jump out of bed and start trying to troubleshoot. You log into your cloud provider and try to provision systems manually, only to find out you can't.

Podcast: Break Things on Purpose | Unpopular Opinions

Time for a bit of a review! Join Jason as he looks back on some previous guests who have shared some opinions that range from the idiosyncratic to down right unpopular. Pulling from a handful of “Breaking Things” interviews, Jason covers death to VPNs, to the validity of “AI Ops,” check out the litany!