Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Actionable Insights - Faster Incident Resolution with Datadog and Moogsoft Observability Cloud

Context is king, they say, and anything you can do to improve context both makes decisions and assessments more reliable and speeds up the decision process. A new, bi-directional integration between Moogsoft Observability Cloud and Datadog does just that. Many SRE teams rely on Datadog to provide comprehensive information about their application stacks.

What is the Difference between SLAs and OLAs?

In traditional IT environments, services to customers are delivered and supported by the organization. A Service Level Agreement (SLA) is created with details like what would be the availability of service be, how reliable the service would be, what penalties can be charged in case of downtime, etc. The internal teams like the network administration team, development team, IT service desk, etc. would then draw up Operational Level Agreements (OLAs) to support the SLA.

"I'm Just Doing my Job," An SRE Myth

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Stay Alert to Security With Xray and PagerDuty

When it comes to securing your software development against open source vulnerabilities, the earlier action occurs — by the right person — the safer you and your enterprise will be. Many IT departments rely on the PagerDuty incident response platform to improve visibility and agility across the organization.

Incident Communication Is a Key Part of Resolving Network Issues

You’ve just received a notification—a major network issue has occurred. Hoping it’s a false positive, you complete an initial triage. Dang it! It’s the real thing. If you’re like me, your mind likely turns to one thing: fixing the issue as fast as you can. But hold on! Before you turn completely to fixing it, there’s another important aspect to any incident that you can’t forget, and that’s incident communication.

Carrefour Bank Uses PagerDuty and Rundeck to Automatically Self-Heal Incidents

With the mission of transforming the customer experience for financial services, Carrefour Bank offers a wide portfolio of financial products created to meet and satisfy different customer needs. Learn how Carrefour Bank leverages PagerDuty and Rundeck to automatically self-heal incidents to keep customers happy and resolution times down.

PagerDuty's Ops Guides Get a Fresh New Look

The Community and Advocacy Team here at PagerDuty recently spruced up our library of ops guides, and we’re excited to share them with you. If you’re not familiar with the ops guides, they are an open-sourced collection of long-form documents that cover a variety of topics related to real-time operations and incident management. We’ve given them some spiffy new headers, cleaned up some sneaky errors, and added a new section titled “Next Steps.”

What the Big Brother Approach to IT Monitoring and Incident Management May Be Missing

We asked in a recent poll which popular TV show your IT team resembles the most. Big Brother came out on top, with almost 40% of respondents saying that their incident resolution process most resembled this show. Would you compare your incident management process to an episode of Big Brother? If so, it's likely that your IT environment is highly monitored, but incidents still seem to slip through the cracks.