Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Introducing Gremlin for AWS

Today, Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure.

Office Hours: How to test zone redundancy using Gremlin

•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Zone failures are rare, but they still happen. When an entire zone fails, many of the most common redundancy techniques fail. How do you avoid outages like these, especially if they affect an entire datacenter?

Don't measure reliability with a lagging indicator like downtime or MTTR

Your reliability measurement can't just be a lagging indicator. "How do you know your company is doing well at reliability? A lot of people will just look at how many outages have you had in the last year and how much customer pain have you caused? I think that's one side of the coin. That's the reactive lagging indicator of the health of our system. To really be good at this, we need a way to understand the risks and the sharp points so that we have an idea of what we're getting into.

Reliability is more important than ever-are you ready?

Reliability and resiliency are getting more and more important. Is your organization ready? "Our digital infrastructure is going to be almost as important as our physical infrastructure. And when it fails, it's going to be a big deal. Like when a huge bank has a multi-day outage, when it impacts travel, safety, military, finance, government, those things are going to be much more important than they have been in the past.

The CTO is responsible for reliability and availability

Who's ultimately responsible for reliability? "You need an executive champion that cares about this. And to me, it's the CTO. The CTO is responsible for the quality of the code that you're writing, the quality of the customer experience, the quality of the product. And so, you know, your software doesn't work. The quality is zero. Not half points here. If you can't use it, it doesn't work.

How Nagarro used Gremlin to prevent a cascading failure outage

Check out how Nagarro used Gremlin to help a client prevent a cascading failure before it caused an outage. "Once we had tested a critical software that was doing millions of online transactions on a daily basis. The design was fail safe, providing redundancy on critical services by having multiple instances deployed on different VMs. What we did was we ran a virtual machine terminate test to bring down an instance of that service with the hypothesis that it will recover automatically. Well, the service did recover automatically, but the system saw a cascading failure.

Strategies for migrating to Kubernetes

Migrating to a new platform can often feel like navigating a maze of technical challenges, especially when the platform is as complex as Kubernetes. Kubernetes has a vast number of features designed to help with deploying and managing large applications, but learning how to use it effectively can be just as challenging as‌ moving your workloads over. This doesn’t mean it’s impossible, of course, and there are several strategies for easing this process.

Amazon makes reliability a priority-do you?

Are you making really reliability a priority? Or are you just giving it lip service? "At Amazon, I was part of the retail website. Outages were lost money, lost money was bad. So Amazon cared deeply about this. That was part of it. The other part was it was part of the engineering culture. When I arrived, one of the things I was told was, we expect you to write high quality, performant, efficient, available code. It's just everybody.

Battletesting Coroot with OpenTelemetry Demo and Chaos Mesh

The most effective method for evaluating an observability tool is to introduce a failure intentionally into a fairly complex system, and then observe how quickly the tool detects the root cause. We’ve built Coroot based on the belief that having high-quality telemetry data enables us to automatically pinpoint the root causes for over 80% of outages with precision. But you don’t have to take our word for it—put it to the test yourself!