Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

More Reliability, Less Firefighting: How to Build a Proactive Reliability Program

Does it feel like your team spends all its time putting out incident fires? Change the story with a proactive reliability program that actively improves reliability. Join reliability expert and engineering leader Jeff Nickoloff for a webinar that lays out the common traits for successful reliability programs so you can build more reliability and spend less time firefighting. You’ll also get a downloadable checklist worksheet to help you create and evaluate your reliability program.

How Detected Risks helps you find reliability risks in minutes-without running any tests

This video showcases Gremlin's Detected Risks feature. Detected risks are high-priority reliability concerns that Gremlin automatically identifies in an environment. These include misconfigurations, bad default values, and reliability anti-patterns. Gremlin prioritizes these risks based on severity and impact, giving instantaneous feedback on risks and action items to improve the reliability and stability of each service.

Four Pillars of a Best-in-Class Reliability Program

Reliability impacts every organization, whether you plan for it or not. Leading companies take matters into their own hands and get ahead of incidents by building reliability programs. But since many of these programs are still nascent, how do you know what good looks like? Of course, the right tools and technology that can enable your team to uncover reliability risks before they impact users play an important role. But improving reliability goes beyond technology.

Announcing the Gremlin Enterprise Chaos Engineering Certification (GECEC) program

We knew Chaos Engineering was in high demand when we first launched the Gremlin certifications in 2021. But we had no idea our Chaos Engineering certification programs would be such a success. There’s a reason: the market is looking for professionals who know how to wield Chaos Engineering well, and Gremlin's certification has become the gold-standard to learn the principles of Chaos Engineering and demonstrate proficiency.

Reliability Best Practices: How Gremlin Uses Gremlin

Ensuring software availability is essential for any SaaS company—including Gremlin. To do that, our teams need to identify the reliability risks hiding in our systems. That’s why our development, platform, and SRE teams use Gremlin regularly to perform Chaos Engineering experiments, run reliability tests, and track the reliability of our systems against our standards. Along the way they’ve picked up a thing or two about how to find and fix reliability risks with Gremlin.

How to Show Reliability Results to Your Organization

Building momentum for a reliability program can be tough. Improving reliability takes time, effort, and resources. But when everything from launching new features to improving security demands those same resources, it can be a struggle to get the buy-in you need to address reliability risks. And it makes sense! If a team spends time patching a known security bug or creating a new feature, they have a clear demonstration of the value created.

Don't Just React to Incidents-Prevent Them

Incident response has been the cornerstone of reliability for decades. From digging in the server logs to navigating modern observability dashboards, responding quickly to incidents and outages is a big part of minimizing downtime. And it should be! When something breaks, your team should move as quickly as possible to address and repair the problem.

Chaos Engineering Tools: Myth vs Fact

With so many Chaos Engineering tools available, it’s no surprise that SRE and platform leaders are doing their homework when choosing a platform to help them build and scale their Chaos Engineering programs. But like anything else you can research on the internet, there’s a lot of noise and hype that you need to wade through. Gremlin works with Reliability Engineering teams at hundreds of companies with the most sensitive workloads—and has since 2016.