February 2024

The case for Fault Injection testing in Production

Feb 27, 2024 By Sam Rossoff In Gremlin

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

Read Post

Gremlin

Read more about The case for Fault Injection testing in Production

How to find and test critical dependencies with Gremlin

Feb 22, 2024 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

View Video

Gremlin

Read more about How to find and test critical dependencies with Gremlin

How to use host redundancy to improve service reliability and availability

Feb 22, 2024 By Andre Newman In Gremlin

Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?

Read Post

Gremlin

Read more about How to use host redundancy to improve service reliability and availability

The two kinds of failure testing

Feb 21, 2024 By Sam Rossoff In Gremlin

Fault injection is a tool, and like all tools, there are a variety of ways operators can employ it, but most of them tend to fall into one of two categories.

Read Post

Gremlin

Read more about The two kinds of failure testing

10 Most Common Kubernetes Reliability Risks

Feb 14, 2024 By Gavin Cahill In Gremlin

Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.

Read Post

Gremlin

Read more about 10 Most Common Kubernetes Reliability Risks

How dependency discovery works in Gremlin

Feb 13, 2024 By Andre Newman In Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

Read Post

Gremlin

Read more about How dependency discovery works in Gremlin

How to make your services zone redundant

Feb 8, 2024 By Andre Newman In Gremlin

In January of 2020, an entire availability zone (AZ) in AWS’ Sydney region suddenly went dark. Multiple facilities lost power, preventing customers from accessing EC2 instances and Elastic Block Storage (EBS) volumes. Customers who didn’t have backup infrastructure in another zone had to wait nearly 8 hours before service was restored, and even then, some EBS volumes couldn’t be recovered. Major cloud provider outages are rare, but they happen nonetheless.

Read Post

Gremlin

Read more about How to make your services zone redundant

Measuring the impact of your reliability work with reports

Feb 6, 2024 By Andre Newman In Gremlin

Improving reliability is important, but how do you prove that your efforts are having an impact? A critical part of reliability work is having the tools to measure and track your progress. Gremlin supports this by providing several built-in reports, which update automatically and are available on-demand. This blog post is a quick introduction to Gremlin’s reporting capabilities.

Read Post

Gremlin

Read more about Measuring the impact of your reliability work with reports

Reducing cloud reliability risks with the AWS Well-Architected Framework

Feb 1, 2024 By Andre Newman In Gremlin

Designing and deploying applications in the cloud can be a labyrinthian exercise. There are dozens of cloud providers, each offering dozens of services, and each of those services has any number of configurations. How are you supposed to architect your systems in a way that gives your customers the best possible experience? AWS recognized this, and in response, they created the AWS Well-Architected Framework (WAF) to guide customers.

Read Post

Gremlin

Read more about Reducing cloud reliability risks with the AWS Well-Architected Framework

Operations | Monitoring | ITSM | DevOps | Cloud

February 2024

The case for Fault Injection testing in Production

How to find and test critical dependencies with Gremlin

How to use host redundancy to improve service reliability and availability

The two kinds of failure testing

10 Most Common Kubernetes Reliability Risks

How dependency discovery works in Gremlin

How to make your services zone redundant

Measuring the impact of your reliability work with reports

Reducing cloud reliability risks with the AWS Well-Architected Framework

Monthly Archive

Follow Us