Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Management and Response

In this video, discover how Cortex transforms incident management by automating key processes, reducing response times, and providing real-time visibility into your engineering ecosystem. With seamless integrations and AI-powered insights, Cortex helps teams go from reactive to proactive, improving reliability and accelerating recovery.

Managing Alerts: Car Alarms and Smoke Alarms

Building and shipping an application is exciting, you watch your idea come alive and reach users. But once it’s out there, your real job begins: keeping it alive. An app in production isn’t just code running, it’s a living system. It needs monitoring to stay healthy and alerting to warn when something’s off. But there’s a catch: too few alerts, and you’ll miss real issues; too many, and you’ll drown in noise.

The one where we scaled

From 3 people in 2020 to 93 in 2025—incident.io has come a long way, and we’re just getting started. Whether you’ve been here since the early days or just joined, this is what it looks like to build something great *together*. If you're after:️️ Great people Real impact (across the globe, not just in Greece) A place where growth is the default And teammates who’ll always be there for you... We’re hiring! (And we're going to need a bigger couch…)

We Built an SRE Agent With Memory And It's Transforming Incident Response

If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.

Too Late to Learn: Why Security Post-Mortems Fail and How AI Can Help

An effective post-mortem can turn a security breach into a blueprint for lasting resilience. But too often, in the stress of an incident, documenting what happened takes a back seat to containment and recovery. The resulting analysis relies heavily on memory, scattered notes, and competing narratives. Valuable context gets lost, timelines blur, and lessons that could strengthen defenses never become institutional knowledge.

How agentic ITOps helps ensure resilient IT infrastructures

Infrastructure resilience is essential for any modern IT environment. Downtime is expensive. Beyond the stresses of day-to-day operations, you want to be confident that your IT systems will continue functioning during service disruptions, hardware failures, or natural disasters. Agentic ITOps can help ensure a reliable, resilient IT infrastructure environment. These systems use agentic AI to help IT teams minimize downtime, improve customer trust, and protect your business’s revenue and reputation.

Jira Service Management (JSM) Review for Alerting (2025)

Atlassian is shutting down OpsGenie. New sales stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s alerting needs, this review will help you decide. I signed up for JSM and put it through real-world testing.