Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

If everyone is AIOps - which AIOps is right for you?

With so many IT vendors claiming they provide AIOps platforms, how do you understand the differences between them, and decide what flavor of AIOPs to choose for your organization? Join us in a CTO Perspective discussion with Elik Eizenberg, CTO and co-founder at BigPanda, to find the answer. Read the skinny for a brief summary, then either lean back and watch the interview, or if you prefer to continue reading, take a few minutes to read the transcript. Enjoy!

SRE vs. DevOps [Understanding Differences & Similarities]

Site Reliability Engineering (SRE) and DevOps share a goal of building a bridge between development and operations. We'll explore and compare both approaches. Wondering to yourself, which is better for your company, SRE or DevOps? Neither SRE or DevOps is “better,” exactly, since they’re similar yet different in a few key ways: SRE, or site reliability engineering, is a methodology developed by Google engineer Ben Treynor Sloss in 2003.

Make your Onboarding Experience Better with a Murder Mystery Game

Onboarding a new tool can be boring. Or stressful. Or both. When onboarding an incident response tool, it can be difficult to make sure that your team is getting the most from the experience. Do you opt for a run-of-the-mill meeting, or try to learn while in an incident? Neither option is ideal. That’s why Petal’s DevOps Engineer Michael Cole found a new way to get his team using Blameless for their incident response process.

SRE Availability Metrics

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.

A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

When I asked Charlie for permission to attend this year’s AICon (virtual, natch) I thought it would be a shoo-in; learning’s part of my OKRs after all. But he never makes things easy and his ‘yes’ came with a caveat that’s typical when dealing with him. This time, he claimed he didn’t have the budget for the ticket (a likely story!) and I’d have to find another way to get one.

WTF is Incident Management? Post-Panel Wrap-Up

That's a wrap! We hosted "WTF is Incident Management" on May 12, 2021. We invited four very knowledgeable panelists to discuss how they define incident management, what changes they'd make if they could start again from scratch, how to manage team stress after an incident, and other subjects. Our panelists were: host Matt Stratton (Staff Developer Advocate at Pulumi), Emily Ruppe (Incident Commander at Twilio), Alina Anderson (Sr.

Enterprise Alert Alarm Center. A NOC's best friend.

Over time, Enterprise Alert continues to grow and more and more teams are starting to benefit from Enterprise Alert’s reliable alerting. As part of this process, Enterprise Alert almost always becomes a central component of the NOC and has practically trained the NOC admins. For this reason, here in support we rarely have the pleasure of presenting the features of our alarm center.