Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Event types and use cases for event correlation

As organizations grow and become more complex, so does the need to monitor and troubleshoot issues across the entire IT infrastructure. Event correlation is a powerful technique that can help make sense of the huge volume of alert data generated by monitoring systems and identify problems as they occur. In this blog, we’ll look at event types, use cases for event correlation and approaches that organizations can use to get the most out of this valuable tool.

How we do realtime response with incident.io, Sentry & PagerDuty

Like most tech companies, we use an on-call rota and various alerting tools. We do this to respond to incidents before they’re reported. Proactively identifying issues and communicating to customers helps us provide great experiences and fosters trust. Internally, we’ve been using these alerting tools in tandem with our auto-create incidents feature. We’ve found that it’s made responding to the pager much smoother - it’s one less thing to do when you get paged at 2am.

iLert is now a verified integration with HCP Consul

More than 16 months ago we provided a solution to integrate HashiCorp Consul with our alerting and on-call management platform by using consul-alerts - a dedicated application that allows for communication between a deployed Consul instance and an existing iLert account. ‍ With more code infrastructure being moved to the cloud to ensure better security and availability, we too have ensured that our service integrates with the HashiCorp Cloud Platform (HCP).

PagerTree 4.0 is finally here!

Today I am excited to announce we have officially shipped PagerTree 4.0! Here are the highlights: This effort has been a year and half in development and I sincerely want to thank each and every one of our customers for the constructive feedback, ideas, and countless hours on Zoom calls. Without you this journey wouldn’t be possible. We are excited to get this major release shipped, just in time for the holidays. You can check out the full details of the upgrade below.

How Many SREs Does Your Company Need? Here's How to Decide

So you’ve decided to take advantage of Site Reliability Engineering by hiring SREs for your company. Now, you have a second decision to make: Exactly how many SREs to hire. Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs? The answers to these questions will, of course, vary; every business’s needs are different.

Webinar: Making the case for AIOps

Over the past few years, artificial intelligence for IT Operations (AIOps) has risen in popularity within the technology landscape. It’s become a buzzword in the marketing world, and while there are many ways to define AIOps, the best way to start thinking about it is through the lens of outcomes, correlation and strategy—it’s all about the data.

Why you should ditch your overly detailed incident response plan

When critical incidents happen — which they inevitably do 😅 — and you’re in the middle of trying to figure out what the best thing to do is, it can feel comforting to know that you’ve got a pre-prepared list of instructions to follow, commonly known as an “incident response plan”: In theory this sounds quite simple, and a typical flow you might envision is: It might be tempting to think that the hardest part of running incidents is finding or writing a checkl