Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Equitably distribute on-call responsibility and streamline incident response with Round Robin Scheduling

PagerDuty is excited to introduce Round Robin Scheduling. Round Robin Scheduling allows teams to equitably distribute on-call shift responsibilities amongst team members. Automatically assigning new incidents across different users or on-call schedules on an escalation level ensures that teams are resolving incidents as efficiently as possible. And, by balancing the workload across multiple users, there’s less risk of burnout.

What exactly is Digital Operations?

IT modernization (for example, cloud computing), digital optimization, and the creation of new digital business models are all examples of digital transformation. The concept of combining company processes with agility, intelligence, and automation to build operational models that delight consumers while also improving performance is known as digital operations.

Intelligent Swarming vs. Tiered Support: How Customer Service Teams can use PagerDuty to Swarm Critical Issues

Most support organizations today adopt some form of the traditional tiered support model. It is one that is based on a process of escalations and customer handoffs. Under this model, customer issues get escalated through multiple levels of a support hierarchy, with three tiers being a common workflow.

Learn how PagerDuty can help address critical work across all departments

PagerDuty’s Operations Cloud helps organizations with critical work across the entire business, from IT teams to customer service to human resources, marketing, sales, and more. With PagerDuty, organizations can prioritize accurately, respond efficiently, and reduce operational overhead. In this blog post, we’ll share examples of how PagerDuty can be used for critical work in all departments, not just IT, using our new Solution Guides for Business.

SRE and the Practice of Practice

Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.

What the Ideal Incident Lifecycle Should Be

Today’s organizations are managing increasingly complex IT ecosystems and pressured to deliver on innovation—all while trying to maintain service performance and reliability to keep up with the always-on digital economy. With IT complexity growing exponentially, incidents have become a common, if not day-to-day struggle for many businesses. Incident management is the process or method that modern organizations use to prepare for and respond to service disruptions.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

Building an SRE Team with Specialization

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization.