Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

PagerDuty Bolsters Leadership Team with Appointments of Chief Information Security Officer and Senior Vice President of Engineering

PagerDuty, Inc. announces the appointments of Pritesh Parekh as Chief Information Security Officer (CISO) and Rukmini Reddy as Senior Vice President of Engineering. With these appointments, the company expands its senior leadership as it continues its commitment to innovating as the most trusted and resilient digital operations management platform for the enterprise.

incident.io is best in class for momentum, relationships and enterprise adoption

Trust doesn’t just happen overnight. For us at incident.io, it’s been a journey—one that’s focused on people just as much as the product. From the start, we knew that building great incident management software wasn’t just about creating features and functionality. It was about building relationships, understanding our users, and truly being there for them when it matters most. Our focus has always been to help teams manage incidents better.

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better. However, all of our engineering rotations rely on hand-off meetings where they update the Slack groups with everyone who’s on call. During my last shift, a small problem kept causing friction for some of our incident management automation.

How Effective are Your Alerting Rules?

Recently, I came across this Reddit post highlighting the challenges of having ineffective alerting rules: And, here at OnPage we have experience with various companies who have dealt with just that, so I felt I should share some of our top tips for creating effective alerting rules in this blog. Read on to discover…

How to build automatic remediation workflows in Grafana Cloud

When incidents occur, engineers must jump into action to get systems back to running at peak performance. However, there are a myriad of challenges that can prevent them from resolving the issues swiftly. Imagine a scenario where a team of DevOps engineers manages a cloud-based e-commerce platform that experiences occasional spikes in traffic during peak shopping seasons. During one of those major sales events, the team notices a sharp spike in CPU usage across several critical application servers.

Demo Roundups! Automation Standardization (Runbook Automation)

Solution consultants Asif Ahmad and Justyn Roberts show how PagerDuty's management and orchestration for the enterprise helps organizations connect and automate work across teams, systems, and environments. Level up your digital operations expertise with PagerDuty Demo Roundups — a series of live, interactive webinars where you can deepen your knowledge in the Operations Cloud and see how PagerDuty can work for you.

Create Round Robin Rotation in Slack using App

‍Pagerly, a Slack App designed for shift scheduling, makes it easy to create round-robin rotations for various teams. Whether it's support team, engineering team, sales team, customer support or any other department, Pagerly helps manage shift schedules and team rosters within your Slack Workspace. Pagerly app can be installed directly from the Slack App Directory, and it is a most comprehensive rotation app designed to optimize scheduling in Slack.

Press Start to Scale: SRE in Gaming - Incidentally Reliable with Denys Pashutynski

In our latest episode, we speak with Denys Pashutynski, Senior Engineering Manager of Site Reliability at Roblox, about the formidable challenges of sustaining a global gaming platform. Drawing from his tenure at Twitter, AWS, and eBay, Denys delves into managing traffic surges, latency optimization, and strategic change management. Exclusively on The Incidentally Reliable podcast, which is made by SREs for SREs and hosted by Zenduty.
Sponsored Post

Financial Benefits of Incident Management: Cost Savings and ROI

Have you ever assessed the financial impact of an hour of downtime on your business? If not, the results might be more alarming than you expect. For large enterprises, the cost can easily reach millions-and that's only the beginning of the potential consequences. And that's just the tip of the iceberg.

How AI is Revolutionizing SaaS and Cloud Software: Key Trends for 2025

In recent years, artificial intelligence (AI) has ceased to be a mere technological trend and has established itself as a foundational element shaping the future of Software as a Service (SaaS) and cloud-based software solutions. By 2025, AI's integration into these domains will not just enhance existing functionalities but redefine what is possible in ways we’re only beginning to comprehend.