Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

How reliability testing and load testing are complementary

How can you tell if your systems are reliable when under load? A common answer is to open your observability dashboards, wait for a high-traffic event (like Black Friday), and cross your fingers. While this approach is certainly effective, it's far from ideal. Without proactive reliability and load testing, we have no idea if a system will hold up to real-world usage patterns, which could mean a production outage at the worst possible time.

How to identify and map service dependencies

Modern applications are a web of interdependent services. As applications grow in size and complexity, and as more engineering teams adopt service-based architectures like microservices, this web becomes deeper and denser. Eventually, keeping track of the interdependencies between services becomes a complex and time-consuming task in and of itself. In addition, if any of these dependencies fails, it can have cascading impacts on the rest of your services and on the application as a whole.

Managing and improving reliability using Gremlin's Reliability Dashboard

Part of a successful reliability program is being able to monitor and review your progress toward improving reliability. Being able to run tests on services is a big part of it, but how can you tell you're making progress if you can only see your latest test results? There should be a way to track improvements or regressions in your reliability testing practice across your organization in a way that's easy to digest. That's where the Reliability Dashboard comes in.

Setting better SLOs using Google's Golden Signals

To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.

Chaos testing: Reliability for cloud-native apps

Reliability is a critical concern for software delivery teams. Every second of lackluster performance or service interruption comes with high costs. The consequences can extend beyond just monetary expenses and have a huge impact on a company’s reputation. In a survey conducted in 2022, participants reported that over 60% of digital infrastructure failures resulted in losses of $100,000 or more.

How to Break Stuff with Chaos Engineering and Chaos Mesh

In 2011, a Netflix engineering team introduced the concept of chaos engineering with its release of Chaos Monkey. This was initially an in-house tool developed to orchestrate fault injection that Netflix eventually made open source. However, the reliance of Chaos Monkey on Spinnaker, another Netflix engineering innovation, establishes some limitations.