Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What do site reliability engineers do?

Are you considering adopting SRE? We will explain the roles and responsibilities of an SRE team within your organization, and how to start building one. So what does an SRE team do? An SRE team is responsible for building software that improves the resiliency of systems, implementing fixes, responding to incidents, and automating processes whenever possible. Site reliability engineering is a holistic practice that incorporates various types of work.

Blameless Runbook Documentation is Now Generally Available!

At Blameless, our mission is to provide teams with the tools they need to operationalize SRE and embrace a culture of resilience. We help teams automate toil and adopt best practices across integrated incident management, comprehensive retrospectives, service level objectives, reliability insights, and more. We are very excited to announce that Blameless Runbook Documentation is now generally available for all customers.

ITSM Buyers' Guide: 7 Use Cases to Define Your ITSM Goals

Attempting an upgrade or switch to a new ITSM tool is obstacle-ridden for IT directors. From having to address fears surrounding the cost of switching vendors to assessing service management maturity, building a case around why and how an ITSM can advance the business can be a harrowing feat. Thankfully, Info-Tech pulled together this selection guide.

Single Sign-On Now Available on OnPage Enterprise-Level Accounts

Single sign-on (SSO) services provide a unified view into applications, logins and devices through a secure identity cloud. SSO allows users to access SaaS-based applications through one simple login process. We, at OnPage, are excited to announce that we’ve extended our integration catalog to include SSO services like Okta and OneLogin. Through a single sign-on process, OnPage enterprise-level users can access the OnPage dashboard from their Okta and OneLogin accounts.

New Integration: Declare FireHydrant Incidents from Checkly Alerts

Streamlining your incident management process is what we do best, and one of the ways we do that is by acting as the connective tissue across all of your applications. We’ve partnered with Checkly to bring you a new integration that empowers you to detect problems and resolve incidents faster.

Use Datadog's Notebooks API to programmatically manage your notebooks

Datadog Notebooks simplify the way teams across an organization find and share knowledge. By bringing together live data and rich Markdown text, Notebooks help teams create powerful, data-driven documents—from runbooks and support playbooks to incident postmortems and data reports. And with collaboration functionalities like real-time editing and commenting, team members can simultaneously make changes to a document and gather feedback along the way.

Resilience in Action Episode 7: Killing Ops with Tony Hansmann

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.