Chaos Engineering

Introducing Process Exhaustion: How to scale your services without overwhelming your systems

Mar 11, 2024 By Andre Newman In Gremlin

We rarely think about how many processes are running on our systems. Modern CPUs are powerful enough to run thousands of processes concurrently, but at what point do our systems become oversaturated? When you’re running large-scale distributed applications, you might reach this limit sooner than you'd expect. How can you determine what that limit is, and how does that affect the number and complexity of the workloads you deploy?

Read Post

Gremlin

Read more about Introducing Process Exhaustion: How to scale your services without overwhelming your systems

How to validate memory-intensive workloads scale in the cloud

Mar 6, 2024 By Andre Newman In Gremlin

Memory is a surprisingly difficult thing to get right in cloud environments. The amount of memory (also called RAM, or random-access memory) in a system indirectly determines how many processes can run on a system, and how large those processes can get. You might be able to run a dozen database instances on a single host, but that same host may struggle to run a single large language model.

Read Post

Gremlin

Read more about How to validate memory-intensive workloads scale in the cloud

Your reliability scorecard: How to measure and track service reliability

Mar 5, 2024 By Andre Newman In Gremlin

If your organization asked you to report on the reliability improvements you’ve made over the past 90 days, would you be able to pull up a report? If you’re like many engineers, this question might make you anxious. Reliability is a difficult metric to quantify in a meaningful way, let alone measure.

Read Post

Gremlin

Read more about Your reliability scorecard: How to measure and track service reliability

The case for Fault Injection testing in Production

Feb 27, 2024 By Sam Rossoff In Gremlin

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

Read Post

Gremlin

Read more about The case for Fault Injection testing in Production

How to find and test critical dependencies with Gremlin

Feb 22, 2024 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

View Video

Gremlin

Read more about How to find and test critical dependencies with Gremlin

How to use host redundancy to improve service reliability and availability

Feb 22, 2024 By Andre Newman In Gremlin

Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?

Read Post

Gremlin

Read more about How to use host redundancy to improve service reliability and availability

The two kinds of failure testing

Feb 21, 2024 By Sam Rossoff In Gremlin

Fault injection is a tool, and like all tools, there are a variety of ways operators can employ it, but most of them tend to fall into one of two categories.

Read Post

Gremlin

Read more about The two kinds of failure testing

10 Most Common Kubernetes Reliability Risks

Feb 14, 2024 By Gavin Cahill In Gremlin

Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.

Read Post

Gremlin

Read more about 10 Most Common Kubernetes Reliability Risks

How dependency discovery works in Gremlin

Feb 13, 2024 By Andre Newman In Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

Read Post

Gremlin

Read more about How dependency discovery works in Gremlin

Chaos engineering in an Azure environment: Confident enough to try it?

Feb 13, 2024 By Geoffrin Edwin In Site24x7

What could go wrong with your Azure environment? Netflix gave the world two beautiful gifts: a media streaming platform for the general public and a wonderful monkey for the tech community. Enough has been said about the media streaming part, so let's play (or work) with the monkey now. When Netflix let the world know about Chaos Monkey, the tech community took a minute to stand and applaud. Since then, it has been a standard to unleash intentional chaos just to see how robust our tech stacks really are.

Read Post