Latest Videos

Introducing Gremlin for AWS

Jun 20, 2024 By Gremlin In Gremlin

Introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS helps teams prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications—with 90% less effort.

View Video

Gremlin

Read more about Introducing Gremlin for AWS

Want more software reliability? It starts with leadership

Jun 20, 2024 By Gremlin In Gremlin

If you want to improve reliability, it has to be important from the top down. "As part of the CTO or leadership owning it, they need to tell folks that it's important in the product roadmap, in some of the development schedule, that we spend time on it, that the CEO is the person that holds people accountable, that they review the metrics, that they sit in the outages, that they understand the quality of the software.

View Video

Gremlin

Read more about Want more software reliability? It starts with leadership

Don't measure reliability with a lagging indicator like downtime or MTTR

Jun 13, 2024 By Gremlin In Gremlin

Your reliability measurement can't just be a lagging indicator. "How do you know your company is doing well at reliability? A lot of people will just look at how many outages have you had in the last year and how much customer pain have you caused? I think that's one side of the coin. That's the reactive lagging indicator of the health of our system. To really be good at this, we need a way to understand the risks and the sharp points so that we have an idea of what we're getting into.

View Video

Gremlin

Read more about Don't measure reliability with a lagging indicator like downtime or MTTR

Office Hours: How to test zone redundancy using Gremlin

Jun 13, 2024 By Gremlin In Gremlin

•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Zone failures are rare, but they still happen. When an entire zone fails, many of the most common redundancy techniques fail. How do you avoid outages like these, especially if they affect an entire datacenter?

View Video

Gremlin

Read more about Office Hours: How to test zone redundancy using Gremlin

Reliability should be about empowering teams to make more resilient software

Jun 11, 2024 By Gremlin In Gremlin

Check out how a customer integrated standardize testing into their CI/CD pipeline with minimal lift from individual teams.

View Video

Gremlin

Read more about Reliability should be about empowering teams to make more resilient software

Reliability is more important than ever-are you ready?

Jun 6, 2024 By Gremlin In Gremlin

Reliability and resiliency are getting more and more important. Is your organization ready? "Our digital infrastructure is going to be almost as important as our physical infrastructure. And when it fails, it's going to be a big deal. Like when a huge bank has a multi-day outage, when it impacts travel, safety, military, finance, government, those things are going to be much more important than they have been in the past.

View Video

Gremlin

Read more about Reliability is more important than ever-are you ready?

The CTO is responsible for reliability and availability

May 30, 2024 By Gremlin In Gremlin

Who's ultimately responsible for reliability? "You need an executive champion that cares about this. And to me, it's the CTO. The CTO is responsible for the quality of the code that you're writing, the quality of the customer experience, the quality of the product. And so, you know, your software doesn't work. The quality is zero. Not half points here. If you can't use it, it doesn't work.

View Video

Gremlin

Read more about The CTO is responsible for reliability and availability

How Nagarro used Gremlin to prevent a cascading failure outage

May 28, 2024 By Gremlin In Gremlin

Check out how Nagarro used Gremlin to help a client prevent a cascading failure before it caused an outage. "Once we had tested a critical software that was doing millions of online transactions on a daily basis. The design was fail safe, providing redundancy on critical services by having multiple instances deployed on different VMs. What we did was we ran a virtual machine terminate test to bring down an instance of that service with the hypothesis that it will recover automatically. Well, the service did recover automatically, but the system saw a cascading failure.

View Video

Gremlin

Read more about How Nagarro used Gremlin to prevent a cascading failure outage

Amazon makes reliability a priority-do you?

May 23, 2024 By Gremlin In Gremlin

Are you making really reliability a priority? Or are you just giving it lip service? "At Amazon, I was part of the retail website. Outages were lost money, lost money was bad. So Amazon cared deeply about this. That was part of it. The other part was it was part of the engineering culture. When I arrived, one of the things I was told was, we expect you to write high quality, performant, efficient, available code. It's just everybody.

View Video

Gremlin

Read more about Amazon makes reliability a priority-do you?

How to run Chaos Engineering experiments in your CI/CD pipeline

May 10, 2024 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Ad-hoc Chaos Engineering experiments are great for learning more about how your systems work, but they don’t tell you how your systems behave over time. As new features get deployed, environments change, and regressions get introduced, even the most resilient systems can gain reliability risks. QA and performance testing are already built into CI/CD - why not reliability?

View Video

Gremlin

Read more about How to run Chaos Engineering experiments in your CI/CD pipeline

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Videos

Introducing Gremlin for AWS

Want more software reliability? It starts with leadership

Don't measure reliability with a lagging indicator like downtime or MTTR

Office Hours: How to test zone redundancy using Gremlin

Reliability should be about empowering teams to make more resilient software

Reliability is more important than ever-are you ready?

The CTO is responsible for reliability and availability

How Nagarro used Gremlin to prevent a cascading failure outage

Amazon makes reliability a priority-do you?

How to run Chaos Engineering experiments in your CI/CD pipeline

Monthly Archive

Follow Us