Operations | Monitoring | ITSM | DevOps | Cloud

Incident Response

Introducing Grafana OnCall OSS, on-call management for the open source community

Last November, we announced the launch of Grafana OnCall, an easy-to-use on-call management tool that helps reduce toil through simpler workflows and interfaces tailored for developers. Born out of Grafana Labs' acquisition of Amixr Inc., Grafana OnCall began as a cloud-only solution that became generally available to all Grafana Cloud users, on both paid and free plans, in February.

Crossing "The Last Mile" with an Incident Response System

Delivering dependable and high-performing IT services in 2022 requires coordination and collaboration across different workflows, areas of expertise, and even time zones. Whether serving in-house colleagues or external clients, there is immense pressure on IT management to create seamless experiences 24/7/365. Seconds matter when critical systems break down, and slow incident resolution can have costly ramifications on customer experience and employee productivity.

The Future of Incident Response is Automated, Flexible, and Proactive

We know our customers rely on PagerDuty as the backbone of critical real-time operations, so we want to make sure each and every enhancement helps streamline incident response. How can we help our customers spend less time firefighting and more time innovating? One of PagerDuty’s values is Champion the Customer – and we take this very seriously. When building and improving features, we aim to keep a pulse on what’s going on with our customers: what’s keeping them up at night?

What's New: Updates to Incident Response, AIOps, Pagerduty Process Automation, and More!

Summit’s right around the corner (have you registered yet?) but the shipping doesn’t stop! We’re excited to announce a new set of updates and enhancements to PagerDuty’s Digital Operations Platform. Recent updates from the product team include On-Call Management, Incident Response, Process Automation, and Integrations, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more.

When incident response requires business response, who should you notify?

From a single on-call engineer hopping online to resolve a problem, to a massive cross-team effort that brings in even the most senior technical leadership (CTO, CISO, or CIO), incident response teams are lucky when they’re able to resolve issues before a customer is aware. But in the cases where there is customer impact, other stakeholders like sales and customer service need to be informed and updated as well.

How to empower your team to own incident response

Responding to and managing incidents feels fairly straightforward when you’re in a small team. As your team grows, it becomes harder to figure out the ownership of your services, especially during critical times. In those moments, you need everyone to know exactly what their role is in order to recover fast. Moving to incident.io as the 7th engineer, from a scaleup of around 70 engineers, has given me a new perspective on what it means to own your code.

Incident response: leadership & psychological safety with Jeli founder Nora Jones

Incident response can tricky. How do you create a culture prepared to handle it in a healthy and efficient way? What should and shouldn't leaders be involved in? How do you create a psychologically safe environment? Find out in this episode as Rob as he interviews Jeli founder Nora Jones on the do's and don'ts of incident response.

How to Make Your Incident Response Plan with Mattermost

For teams who deploy software to users around the world, every second counts when responding to outages and other incidents. It’s important that you have tools in your arsenal that are up to the challenge. Service monitoring, alerting, collaboration, and visibility are all essential components of a well-implemented incident response plan.

5 Ways Automated Incident Response Reduces Toil

Toil — endless, exhausting work that yields little value in DevOps and site reliability engineering (SRE) — is the scourge of security engineers everywhere. You end up with mountains of toil if you rely on manual effort to maintain cloud security. Your engineers spend a lot of time doing mundane jobs that don’t actually move the needle. Toil is detrimental to team morale because most technicians will become bored if they spend their days repeatedly solving the same problems.

Kubernetes Incident Response Best Practices

Inevitably, organizations that use technology (regardless of the extent) will have something, somewhere, go wrong. The key to a successful organization is to have the tools and processes in place to handle these incidents and get systems restored in a repeatable and reliable way in as little time as possible.