Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

How Gremlin's reliability score works

In order to make reliability improvements tangible, there needs to be a way to quantify and track the reliability of systems and services in a meaningful way. This "reliability score" should indicate at a glance how likely a service is to withstand real-world causes of failure without having to wait for an incident to happen first. Gremlin's upcoming feature allows you to do just that.

Chaos Engineering Tools: Build vs Buy

Chaos Engineering, where engineers intentionally inject failure to test the reliability of their systems, is becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Chaos Engineering has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake reliability into every feature.

Podcast: Break Things on Purpose | Developer Advocacy and Innersource with Aaron Clark

In this episode, Jason chats with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. Aaron shares what it was like starting out as a developer at RBC and working in early cloud development, and then transitioning to his role as a developer advocate. Jason and Aaron talk about the value applying open source principles within organizations, or “innersource.” Their time ends with a discussion on continuing education and how to keep learning.

Gartner: tips for improving reliability

In their report titled “IT Resilience — 7 Tips for Improving Reliability, Tolerability and Disaster Recovery”, Gartner presents seven strategies for improving the resilience posture of your critical systems. These recommendations range from how to get started, to identifying IT hazards and risks to reliability, to capturing metrics and translating them into business value. In this blog, we’ll take a high-level look at the report and summarize some of its key findings.

Podcast: Break Things on Purpose | KubeCon, Kindness, and Legos with Michael Chenetz

In this episode, we chat with Cisco’s head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts.

Podcast: Break Things on Purpose | Dan Isla: Astronomical Reliability

It’s time to shoot for the stars with Dan Isla, VP of Product at itopia, to talk about everything from astronomical importance of reliability to time zones on Mars. Dan’s trajectory has been a propulsion of jobs bordering on the science fiction, with a history at NASA, modernizing cloud computing for them, and loads more. Dan discusses the finite room for risk and failure in space travel with an anecdote from his work on Curiosity.

How Gremlin runs a GameDay

You might be familiar with GameDays at this point. From watching our Introduction to GameDay webinar, viewing our Demo video, and reading our tutorial, you’ve probably learned that GameDays were created with the goal of increasing reliability by purposely creating major failures on a regular basis. Better yet, perhaps your own team has run a GameDay and learned something new about their services’ behavior during failure scenarios.

Introduction to GameDay webinar

Learn all about Gremlin's GameDay feature in this webinar presented by Sydney Lesser and Andre Newman. GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them. Increase your system’s reliability with safe, secure, and simple GameDays.