Chaos Engineering

Destroy on Friday: The Big Day A Chaos Engineering Experiment - Part 2

Jul 23, 2024 By Lex Neva In Honeycomb

In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen. This is part two, where I go over the big day. How did our chaos engineering experiment go? Find out below!

Read Post

Honeycomb

Read more about Destroy on Friday: The Big Day A Chaos Engineering Experiment - Part 2

How to balance reliability with other DevOps priorities

Jul 23, 2024 By Gremlin In Gremlin

Reliability efforts do take up some bandwidth, but in the end it's worth it—as our customers find out when their outage costs go down. "Everyone has their own priorities that they're dealing with. Given unlimited time and money, absolutely everyone would want to build the best possible system that is the most secure, performant, resilient, and everything.

View Video

Gremlin

Read more about How to balance reliability with other DevOps priorities

Chaos Testing Explained

Jul 19, 2024 By Shanika Wickramasinghe In Splunk

Chaos testing is a part of site reliability engineering (SRE). In chaos testing, we intentionally break things in and around a given application, in order to: The purpose of chaos testing is to assess how software systems respond to scenarios like network outages, hardware failures, database failures, and server or cluster node failures in the infrastructure.

Read Post

Splunk

Read more about Chaos Testing Explained

How to Build Resilience Throughout Your SDLC Lessons from a Top 10 Bank

Jul 19, 2024 By Gremlin In Gremlin

Are your applications as reliable as you planned? How do you know? The only way to ensure systems are resilient to common failure conditions is to test them, yet many large enterprises struggle with the effort and expense to do so. In this webinar, Anantha Movva, a former head of SRE and Performance Engineering at one of the top 10 North American banks, will share how he drove Chaos Engineering and resilience testing adoption throughout his organization.

View Video

Gremlin

Read more about How to Build Resilience Throughout Your SDLC Lessons from a Top 10 Bank

Software reliability and availability is the whole team's problem-not just a few engineers

Jul 18, 2024 By Gremlin In Gremlin

Reliability is everyone's problem—not just the SRE team's. "It's not just the SRE's problem. It's everybody's problem. So the SREs, they can run point and they can help report and help us understand, but we also have to hold the teams accountable. Are the teams investing time in reliability? Are they finding and fixing issues? Are we giving them space? And I think that comes back to, does the business see the benefit and do we have a good way of quantifying the benefit to the business?"—Kolton Andrus, Gremlin CTO.

View Video

Gremlin

Read more about Software reliability and availability is the whole team's problem-not just a few engineers

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

Jul 16, 2024 By Lex Neva In Honeycomb

We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in our production environment using AWS’s Fault Injection Service. You might be wondering why the heck we did something so drastic. In this post, we’ll go over why we did it and how we made sure that it wouldn’t impact our service.

Read Post

Honeycomb

Read more about Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

Testing for expiring TLS and SSL certificates using Gremlin

Jul 16, 2024 By Andre Newman In Gremlin

Encryption is a fundamental part of nearly every modern application, whether you’re storing data, sending data to customers, or sharing data between backend services. Most organizations have a data encryption strategy, and nearly every web page is using HTTPS, thanks to initiatives like Let’s Encrypt. But setting up encryption isn’t a one-time initiative. Over time, the certificates backing modern encryption expire and need to be replaced.

Read Post

Gremlin

Read more about Testing for expiring TLS and SSL certificates using Gremlin

Spend a little time on software reliability now instead of a lot of time later

Jul 11, 2024 By Gremlin In Gremlin

You're going to spend time fixing reliability—but it's your choice whether it's during an outage or ahead of time on your schedule and for less costs. Which will you choose? "We all know when things go wrong, it cost us a million dollars and it was really bad. Let's have that never happen again. But when we say, I need every engineering team to spend one hour, one day a week on reliability, does everyone lose their mind, or is that a reasonable request? Can we amortize out the cost of that?

View Video

Gremlin

Read more about Spend a little time on software reliability now instead of a lot of time later

How to run fault injection tests on AWS managed services

Jul 11, 2024 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Fully-managed SaaS services offer incredible scalability and accessibility, but at a cost: they’re also single points of failure. If your application depends on a SaaS service and the service fails, guess who your customers will blame? We need to design applications to anticipate and work around managed service failures, but how do we do that without having to wait for the service to fail?

View Video

Gremlin

Read more about How to run fault injection tests on AWS managed services

How to load-balance across multiple availability zones for improved redundancy

Jul 11, 2024 By Andre Newman In Gremlin

Load balancers are some of the most important load-bearing (pun intended) components in cloud environments. They perform multiple critical tasks: network switching, packet inspection, and of course, routing. Most cloud-based load balancers focus on load balancing within a single zone, but what if you have resources spread across multiple zones?

Read Post