Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

3 Ways SRE Can Boost your Business Value

Adopting SRE principles into your organization can be a big undertaking. You’ll need to develop new practices and procedures to minimize the costs of incident coordination. You’ll need to create a retrospective process that encourages continuous learning. You’ll need to shift culture to begin appreciating failure as an opportunity to grow. Your transition to the world of SRE will also require buy-in from all levels of your organization.

VictorOps and Relay for Incident Response

VictorOps is an incident response tool whose mission is straightforward: “To make being on call suck less.” It enables teams to quickly detect and respond to problems like a service degredation or outage. VictorOps supports a wide range of external integrations to extend its capabilities by connecting different parts of your DevOps toolchain.

Incident Ready: How to Chaos Engineer Your Incident Response Process

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, shared how our customers leverage best practices to break, mitigate, resolve, and fireproof incident processes.

Incident Ready: How to Chaos Engineer Your Incident Response Process | FireHydrant

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, will share how FireHydrant customers leverage best practices to break, mitigate, resolve, and fireproof incident processes. We’ll show you how to use chaos engineering philosophies to stress test 3 critical parts of a great process.

Microsoft's 3 major incidents in 10 days, where did they go wrong?

Just in case you haven’t heard, last week Microsoft experienced a huge outage that prevented users from accessing its Office 365 cloud-based subscription service which serves 200 million active monthly users. This latest outage was the third in ten days, causing the company to receive a deluge of customer complaints about a 'something went wrong' message that popped up when they tried to access their accounts.

October 2020 Update: Mute overwrite for iPhone (Critical Alerts), undo and more

Our October update brings the long-awaited mute-overwrite on iPhone (‘critical alerts’). We also introduce an undo action for Signl acknowledgements or closures. And in the web app you can now batch-ack and close to multiple Signls at once. All new features are introduced below – enjoy.

PagerDuty Summit: Lacework on the Shared Irresponsibility Model of Cloud Security

Cloud security has become increasingly complex of late. Cloud providers use tens of thousands of APIs, container orchestration systems are growing in number and complexity, and more platforms and services are entering the cloud-native ring. What’s more, each of these components pose a potential security risk to organizations. And it’s you as the customer that’s responsible for the configuration and security of those components.