Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AI-Assisted Incident Management Communication

‍ AI has revolutionized various aspects of incident response, from preparation to resolution. Across the incident response lifecycle, AI is being leveraged to streamline processes, reduce noise, and improve overall efficiency. One critical area where AI is making a significant impact is in incident communication. Effective and efficient communication is crucial during incidents, as it ensures that stakeholders are informed and aligned with the incident status and resolution efforts.

Crisis Management for Oil and Gas Companies

Oil and gas companies operate in a high-stakes environment where the potential for catastrophic incidents, such as oil spills, explosions, and natural disasters always exists. These risks necessitate the establishment of robust crisis management for oil and gas companies to ensure the safety of their personnel and minimize potential damage to their operations and organizational reputation.

xMatters Workflow Overview - 2024

Everbridge xMatters automates workflows to eliminate business-impacting digital events, leveraging analytics, automation, and AI to improve response time and resolution. I will be walking through key features in xMatters that will keep your digital businesses running, reducing the frequency, duration, and associated cost of critical service disruptions.

A guide to Grafana OnCall SMS and call routing

Many organizations use incident response setups that enable them to page on-call personnel via calling or sending a message to a phone number. In this guide, you will learn how to configure such a system by using Grafana OnCall. For practical purposes, we’ll pair it with Twilio, though the same basic workflow should be applicable to other platforms. We will start with a basic setup that uses a phone number in Twilio to both call and send SMS messages to a webhook integration in Grafana OnCall.

Pagerly now available on Microsoft Teams - Manage Oncalls, Tickets and Incidents on MS Teams

Manage Oncalls, Incidents on Microsoft Teams (Integrate Pagerduty, Opsgenie) Get Oncall Change Notifications within Microsoft Teams. Mention Current Oncall Automically in any conversation without switching applications.

What is Mean Time to Repair (MTTR)?

Mean time to repair (MTTR) is a metric used to measure the average time required to diagnose and fix a malfunctioning system or component, ensuring it returns to full operational status. In software development, downtime halts user access and disrupts operations, leading to customer dissatisfaction and financial losses. In manufacturing, it slows production, affecting supply chains and profitability. In healthcare, downtime can compromise patient care and safety.

Our simple incident post-mortem template

Clean, clear, and ready to be customized to suit your needs. Google Docs Having a dedicated incident post-mortem is just as important as having a robust incident response plan. The post-mortem is key to understanding exactly what went wrong, why it happened in the first place, and what you can do to avoid it in the future.

Automation in MSPs: Streamlining Service Delivery and Boosting Profitability

In today’s complex IT environment, clients demand quick, reliable services. To accomplish this, businesses have begun leveraging automation solutions to reduce response times and increase reliability, enabling staff to focus on strategic initiatives that drive business growth. However, many MSPs struggle to build an effective automation strategy and need help, making it challenging to remain competitive in the modern marketplace.

Scaling into the unknown: growing your company when there's no clear roadmap ahead

During a recent episode of ⁠The Debrief⁠, we spoke with Jeff Forde, Architect on the Platform Engineering team at Collectors, about building an incident management program at various stages of growth. In that episode, we called it growth from zero to one, one to two, and two to three. But what happens once you’ve scaled beyond three and answers to question you may have become that much harder to find.

Automation in MSPs: Streamlining Service Delivery and Boosting Profitability

In today’s complex IT environment, clients demand quick, reliable services. To accomplish this, businesses have begun leveraging automation solutions to reduce response times and increase reliability, enabling staff to focus on strategic initiatives that drive business growth. However, many MSPs struggle to build an effective automation strategy and need help, making it challenging to remain competitive in the modern marketplace.