Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Intelligent Swarming vs. Tiered Support: How Customer Service Teams can use PagerDuty to Swarm Critical Issues

Most support organizations today adopt some form of the traditional tiered support model. It is one that is based on a process of escalations and customer handoffs. Under this model, customer issues get escalated through multiple levels of a support hierarchy, with three tiers being a common workflow.

Learn how PagerDuty can help address critical work across all departments

PagerDuty’s Operations Cloud helps organizations with critical work across the entire business, from IT teams to customer service to human resources, marketing, sales, and more. With PagerDuty, organizations can prioritize accurately, respond efficiently, and reduce operational overhead. In this blog post, we’ll share examples of how PagerDuty can be used for critical work in all departments, not just IT, using our new Solution Guides for Business.

SRE and the Practice of Practice

Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.

What the Ideal Incident Lifecycle Should Be

Today’s organizations are managing increasingly complex IT ecosystems and pressured to deliver on innovation—all while trying to maintain service performance and reliability to keep up with the always-on digital economy. With IT complexity growing exponentially, incidents have become a common, if not day-to-day struggle for many businesses. Incident management is the process or method that modern organizations use to prepare for and respond to service disruptions.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

Building an SRE Team with Specialization

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization.

The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call

Within DevOps, we talk a lot about the on-call process—but what about the human side of being on-call? For example, what are effective ways of managing stress and anxiety during a shift? How can one manage life situations that make being on-call difficult—such as being responsible for watching the kids during an on-call rotation? And how can an empathic team culture help prevent burnout and turnover?