Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Response: A Step-by-Step Guide to Managing Incidents

Looking into Incident Response? We explain incident response, the end-to-end process, the teams involved, and steps to take to avoid friction and slow-down. The goal is to manage the incident as efficiently as possible in order to restore or resume the service to its expected operational state.

The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More

Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed to extreme speeds to digitize and meet the rising online demand. During this time, organizations learned important lessons that they’ll carry on with them into this new future. Leaders can take these learnings and use them to build better products, healthier and more efficient teams, and a happier customer base.

PagerDuty Integration Spotlight: InfluxData

InfluxData is an Open Source Platform built for metrics and events — a platform that is purpose-built for time series data. The essential time series toolkit — dashboards, queries, tasks and agents all in one place. InfluxDB is even more programmable and performant with a common API across OSS, cloud and enterprise editions. Send events to PagerDuty to keep your teams informed. Check out InfluxData’s integration.

Facebook, Instagram, and Whatsapp's Outage - Understanding MTTR

Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network.

PagerDuty Integration Spotlight: HashiCorp Terraform

Manage your PagerDuty account objects with Terraform! Reap all the benefits of infrastructure as code and give your teams the flexibility they need to manage their services in real time. As infrastructure stacks grow increasingly more complex and involve an ever-growing number of services and systems, teams have looked to abstract configuration to its own layer of code. This concept of configuring infrastructure as code is gaining traction throughout the industry for a variety of reasons.

The Aftermath of the Facebook 6-Hour Outage

Less than 24 hours ago, the world came to a “social standstill” as Facebook, and its sister companies, WhatsApp and Instagram, became unavailable, leaving its 3.5 billion users in a flap. The outage, which lasted almost 6 hours, shut off access for users and businesses all over the world and caused ripple effects that we will likely continue to see in the immediate (and perhaps not-so-immediate) future.