Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Management and Status Pages for Enterprise IT Departments

The Incident Management and Status Page solution that lets you organize your enterprise IT team and communicate with users for a coordinated response that restores services rapidly. StatusCast works as an Incident Management platform to increase employee productivity inside organizations. There’s a lot you can do with StatusCast status pages to create the brand look you are seeking.

How to detect anomalies in logs, metrics, and traces to reduce MTTR with Elastic Machine Learning

Elastic Observability has extensive machine learning capabilities that support and improve analysis in APM. Learn techniques for correlating and detecting anomalies of telemetry data from APM agents for a particular application.

Managing a Slew of Monitoring Tools? Here's How to Make Them Talk.

Engineering teams use a lot of single-domain monitoring tools. In fact, the average team manages and maintains 16 monitoring tools — and up to 40 — according to Moogsoft’s State of Availability Report. While IT leaders select and implement these tools to save teams time, our research finds they do quite the opposite. Engineers spend far and away more time on monitoring than they do on any other task — innovative, value-creating tasks included.

The Importance of Role-Based Messaging in Healthcare

Do you remember the classic board game where you have to go back and forth with your opponent deducing which characters on the board you’ve each selected? It’s still played by children today, and unfortunately by healthcare teams as well. Every day, healthcare teams are forced to play a game of “Guess Who?” is on-call if they do not have systems in place for role-based messaging.

Expanding Incident Response with Microsoft Teams

Last week we launched a number of features across the PagerDuty Operations Cloud portfolio to help teams minimize downtime and protect customer experience. One of the areas where PagerDuty continues to invest is collaboration and communication during incident response to ensure that all impacted stakeholders across the business are updated in real-time.

Early stage data teams: a balancing act

Most well established data teams have a clear remit and a well defined structured for what they work on and when: from the scope of their role (from engineer to analyst) to which part of the business they work with. At incident.io, we have a 2 person data team (soon to be 3) with both of us being Product Analysts.

Empower the SREs - Conclusions from The SRE Report 2023

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems.

Building an incident management process - incident.fm

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.

Building an incident management process

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.