Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Sponsored Post

What is MTTR? Resolve incidents faster through ops, alerting and documentation

When downtime strikes any distributed software deployment or platform, it's all hands on deck until the lights are green and service is restored. This process, from the recognition of a problem to a deployed solution, has most commonly been defined as MTTR - mean time to resolution. In just the last few years, DevOps and site reliability (SRE) professionals have developed sophisticated new models for how they work and audit their successes. In 2022, MTTR is one of the most widely-used software performance success metrics.

The startup guide to sensible incident management

If you’re working at an early stage startup and looking to get some good incident management foundations in place without investing excessive time and effort, this guide is quite literally for you. There’s an enormous amount of content available for organisations looking to import ‘gold standard’ incident management best practices – things like the PagerDuty Response site, the Atlassian incident management best practices, and the Google SRE book.

Now You can Invoke PagerDuty Rundeck Actions Within the PagerDuty Slack Integration

Last year, we released PagerDuty Rundeck Actions, a PagerDuty add-on product that connects responders to automated diagnostics and remediation for common problems directly in the PagerDuty incident response workflow. After working with our customers and listening to the community, we are excited to announce that PagerDuty Rundeck Actions now integrates with PagerDuty’s Slack integration.

Announcing Grafana Incident, smart incident management for your teams

A huge challenge when dealing with incidents is the coordination and communication needed to put things right. What’s happened so far? Who has tried what query? Did we remember to keep stakeholders informed? What is the severity of the incident? Does this affect customers? Figuring this out requires a lot of back and forth as new team members join the incident.

Grafana Incident: First look at the smart incident management tool

Announcing Grafana Incident, the smart incident management tool for your teams. Grafana Incident allows teams to start collaborating immediately by automatically setting up all the essential spaces and resources needed for incident response, from Zoom meetings and Slack channels to a tracker for important tasks and TODO items. A chatbot offers a command-line interface for managing incidents, and provides the ability to instantly embed Grafana queries, dashboards, and metadata, GitHub issues and pull requests, and more. Grafana Incident is available in preview for Grafana Cloud users.

Grafana OnCall is now generally available on Grafana Cloud, with a generous free tier

Today we’re announcing the general availability of Grafana OnCall on Grafana Cloud for all paid and free plans. A big part of delivering great software is ensuring the right people get the right information when the inevitable incidents occur. We want to help you do that with Grafana OnCall, an easy-to-use, developer-first on-call management tool that’s built on top of the Grafana stack you know and love.

Top tips to make Round Robin Scheduling successful for your team

You may have heard of Round Robin Scheduling before and thought to yourself, is this right for my team? Understanding how Round Robin Scheduling can be used and what teams it works best for is important when considering this method of on-call. Additionally, it comes with some pitfalls you’ll want to avoid, as well as best practices to adopt. In this blog post, we’ll share everything you need to know about Round Robin Scheduling within PagerDuty and how to get started.