Chaos Engineering

Release Roundup Sept 2023: Measurably improve reliability

Oct 2, 2023 By Ryan Detwiller In Gremlin

It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.

Read Post

Gremlin

Read more about Release Roundup Sept 2023: Measurably improve reliability

Five mindset shifts for effective reliability programs

Sep 28, 2023 By Gavin Cahill In Gremlin

When people think about reliability, it’s easy to focus on incident response and moving fast to fix outages. This reactive approach to reliability can very quickly lead to burnout as you bounce from incident to incident. But that’s not the only way to think about reliability.

Read Post

Gremlin

Read more about Five mindset shifts for effective reliability programs

How to ensure your Kubernetes Pods have enough memory

Sep 26, 2023 By Andre Newman In Gremlin

Memory (or RAM, short for random-access memory) is a finite and critical computing resource. The amount of RAM in a system dictates the number and complexity of processes that can run on the system, and running out of RAM can cause significant problems, including: This problem can be mitigated using clustered platforms like Kubernetes, where you can add or remove RAM capacity by adding or removing nodes on-demand.

Read Post

Gremlin

Read more about How to ensure your Kubernetes Pods have enough memory

How a simple metric drives reliability culture at Slack

Sep 21, 2023 By Andre Newman In Gremlin

How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates? Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators.

Read Post

Gremlin

Read more about How a simple metric drives reliability culture at Slack

Deploying a multi-availability zone Kubernetes cluster for High Availability

Sep 20, 2023 By Andre Newman In Gremlin

Many cloud infrastructure providers make deploying services as easy as a few clicks. However, making those services high availability (HA) is a different story. What happens to your service if your cloud provider has an Availability Zone (AZ) outage? Will your application still work, and more importantly, can you prove it will still work? In this blog, we'll discuss AZ redundancy with a focus on Kubernetes clusters.

Read Post

Gremlin

Read more about Deploying a multi-availability zone Kubernetes cluster for High Availability

Using Ephemeral Environments for Chaos Engineering and Resilience Testing

Sep 19, 2023 By Morgan Perry In Qovery

Dive into the world of chaos engineering and resilience testing with ephemeral environments, ensuring your systems can withstand unexpected challenges.

Read Post

Qovery

Read more about Using Ephemeral Environments for Chaos Engineering and Resilience Testing

How to keep your Kubernetes Pods up and running with liveness probes

Sep 12, 2023 By Andre Newman In Gremlin

Getting your applications running on Kubernetes is one thing: keeping them up and running is another thing entirely. While the goal is to deploy applications that never fail, the reality is that applications often crash, terminate, or restart with little warning. Even before that point, applications can have less visible problems like memory leaks, network latency, and disconnections. To prevent applications from behaving unexpectedly, we need a way of continually monitoring them.

Read Post

Gremlin

Read more about How to keep your Kubernetes Pods up and running with liveness probes

Automate reliability testing in your CI/CD pipeline using the Gremlin API

Sep 7, 2023 By Andre Newman In Gremlin

For many software engineering teams, most testing is done in their CI/CD pipeline. New deployments run through a gauntlet of unit tests, integration tests, and even performance tests to ensure quality. However, there's one key test type that's excluded from this list, and it's one that can have a critical impact on your application and your organization: reliability tests. As software changes, reliability risks get introduced.

Read Post

Gremlin

Read more about Automate reliability testing in your CI/CD pipeline using the Gremlin API

How to ensure your Kubernetes Pods have enough CPU

Sep 5, 2023 By Andre Newman In Gremlin

Gremlin's Detected Risks feature immediately detects any high-priority reliability concerns in your environment. These can include misconfigurations, bad default values, or reliability anti-patterns. A common risk is deploying Pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your Pod from running.

Read Post

Gremlin

Read more about How to ensure your Kubernetes Pods have enough CPU

More Reliability, Less Firefighting: How to Build a Proactive Reliability Program

Aug 31, 2023 By Gremlin In Gremlin

Does it feel like your team spends all its time putting out incident fires? Change the story with a proactive reliability program that actively improves reliability. Join reliability expert and engineering leader Jeff Nickoloff for a webinar that lays out the common traits for successful reliability programs so you can build more reliability and spend less time firefighting. You’ll also get a downloadable checklist worksheet to help you create and evaluate your reliability program.

View Video