Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to Improve Your Building Management System

A building management system (BMS) lets your business monitor and control mechanical and electrical equipment across one or more buildings. Heating, cooling, and ventilation (HVAC), security, and other systems linked to a BMS usually represent 70% of a building’s energy usage. So, proper configuration of your BMS is key — otherwise, a poorly configured system can negatively impact your building’s efficiency, maintenance, security, and safety.

4 Tips on Preparing for a [Great] Failure

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

What are MTTR, MTBF, MTTF, and MTTA? A guide to Incident Management metrics

In the present fast-moving digital world, it has become critical for businesses to measure and track their service delivery performance especially the incident management metrics that monitor the uptime of systems, downtime due to outages, and how fast and efficiently issues are resolved because even a slight glitch in the system can cause disruption in the business processes costing millions of dollars.

Using BigPanda and ServiceNow to prevent and resolve outages

BigPanda augments ServiceNow and helps IT Ops teams work more efficiently in modern IT Stacks, reducing MTTR by 40% or more. By using BigPanda and ServiceNow together, IT Ops teams are provided with real-time service mapping for dynamic infrastructures, can easily reduce and automate ServiceNow ticketing, and are able to surface the root cause changes affecting their continuous delivery.

Customer Devotion: How We're Bringing OneDuty to Life

It’s been almost a year since the world changed overnight and industries across the world quickly adapted to living, working, and learning fully virtually. While the world seemed to stop in an instant, many businesses saw an increase in demand and new challenges. PagerDuty was no different.

Communication Tool Down? Here are 3 Ways to Handle it

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

How to get a phone call when your API fails

Learn how you can get a phone call alert when your API fails. Spike.sh sends you alerts via phone call, SMS message, email and Slack when you have any issues in production. Spike.sh integrates with your infrastructure, performance monitoring, error tracking, uptime monitoring, API monitoring and cron job monitoring tools. Our integrations include AWS, Google Cloud, Datadog, Grafana, Prometheus, New Relic and many more.

How to get a phone call when your cron job fails

Learn how you can get a phone call alert when your cron job fails. Spike.sh sends you alerts via phone call, SMS message, email and Slack when you have any issues in production. Spike.sh integrates with your infrastructure, performance monitoring, error tracking, uptime monitoring, API monitoring and cron job monitoring tools. Our integrations include AWS, Google Cloud, Datadog, Grafana, Prometheus, New Relic and many more.