Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Reliability is not an engineering metric

If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production. All of these are important for any business to remain competitive.

A Developer's Perspective: Lessons from Open Source with FireHydrant and Backstage

We’re proud to announce that our front end FireHydrant plug in has been open-sourced as part of Backstage, an open platform for infrastructure tooling, services, and documentation created at Spotify. We introduce FireHydrant’s incident management and analytics in Backstage, where you can quickly and efficiently manage your incidents.

Getting Started with Site Reliability Engineering

Site Reliability Engineer (SRE) is one of the fastest growing jobs in tech, with Linkedin reporting 34% growth YoY in 2020 and over 9000 openings in their Emerging Jobs Report. If you’re new to SRE and exploring it as a career path, understand that it can be a challenging but rewarding experience. Here are some quick tips on how you can get started with SRE and jump-start a rewarding career.

We've raised a $23M Series B to help us get to a world where all software is reliable

At FireHydrant, we envision a world where all software is reliable, and we’re on a mission to help every company that builds or operates software get closer to 100% reliability. Today, we’re thrilled to announce that we’ve raised $23 million to help us further our goal.

Pragmatic Incident Response: 3 Lessons Learned from Failures

In my past experience as an SRE I’ve learned some valuable lessons about how to respond and learn from incidents. Declare and run retros for the small incidents. It's less stressful, and action items become much more actionable. Decrease the time it takes to analyze an incident. You'll remember more, and will learn more from the incident. Alert on pain felt by people — not computers. The only reason we declare incidents at all is because of the people on the other side of them.

The MTTR that matters

“Mean time to X” is a common term used to describe how long, on average, a particular milestone takes to achieve in incident response. There’s mean time to detect, acknowledge, mitigate, etc. And then there’s the elusive “mean time to recover,” also known as “MTTR.” MTTR, a hotly debated acronym and concept, measures how long it takes to resolve an incident on average. The problem with MTTR, though, is that it doesn’t matter.

FireHydrant May 2021 Product Updates: The summer of integrations

With 50% of the US adult population vaccinated, there’s a lot to look forward to this summer, life no longer feels like it’s on hold, and we’re fully embracing that. Get your fire hoses ready, 'cause extinguishing incidents just got easier. We’re rolling out a summer full of new integrations, product releases, events, and more.