Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Top 6 Reasons Why You Need a Status Page Aggregator

Your business depends on the reliability of the third-party services you use. Monitoring the status pages of these services is the best way of keeping track of their outages and maintenances. Although some status pages let you subscribe to alerts, there is no standard way of doing this. Service providers can change their status page providers, disable subscriptions, or not support the same notification options.

Feature Spotlight - Incident Automations

From managing issues and resources to keeping customers updated, resolving an incident requires a level of multi-tasking that can be overwhelming for even the most efficient of teams. Automating your processes reduces the time needed to diagnose, mitigate, and resolve incidents, and simplifies communication throughout an incident's lifecycle.

Remediate Kubernetes incidents faster using private actions in your apps and workflows

The Datadog Action Catalog provides more than 1,400 actions to help you accelerate remediation across your infrastructure directly within Datadog. With actions, you can use Workflow Automation to configure workflows that automatically address issues as they happen and build custom apps in App Builder that empower anyone in your organization to act when incidents occur.

Incident Response Management: A Category of Its Own

In recent weeks, I’ve spoken with several Opsgenie customers who are evaluating a migration to ilert after Atlassian’s decision to phase out Opsgenie and fold its functionality into other products. Atlassian is giving Opsgenie users “two options: move to Jira Service Management for robust end-to-end incident management, or move to Compass for alerting and on-call management.” This has raised a broader question in our industry: ‍

From Tickets to Action: Ensuring Proactive IT Support with Jira and OnPage

We’re excited to announce the launch of our bi-directional integration between OnPage and Jira! This integration is designed to bridge the gap between ticket creation and incident response, ensuring that IT, DevOps and other tech teams who rely on Jira to manage their incidents can automatically identify and engage the right on-call staff—ensuring critical incidents are addressed in real time without delay.

OpsGenie End of Life? What's next for OpsGenie users.

If you haven’t heard already (which would be shocking considering the numerous posts I’ve seen on Reddit) Opsgenie’s end of life is right around the corner. This means there is no better time for Opsgenie users to explore alerting and on-call management tools outside of the limited alternatives provided by Atlassian. So, I felt now is a better time than any to address the needs of those affected by the dissolution of Opsgenie and reveal why OnPage should be your new platform of choice.
Sponsored Post

Incident Response Process: Stages, Framework & Best Practices

These days, organizations must be prepared to handle unexpected disruptions efficiently. Whether it's a cybersecurity breach, system failure, or a natural disaster, having a structured Incident Management Process is essential. The Incident Management Team plays a crucial role in swiftly identifying, assessing, and resolving incidents, minimizing downtime, and ensuring business continuity. This blog explores the stages, framework, and best practices of incident management to help businesses build a robust response system.

An ultimate step-by-step guide on Zabbix Cloud Monitoring

‍ Learn how to set up Zabbix Cloud for AWS Auto-Discovery and receive critical alerts via SMS, phone calls, or push notifications. ‍ During the last Zabbix Summit, the company presented a cloud version of its well-known monitoring platform. We at ilert constantly see the growing popularity of Zabbix as more and more teams across the globe utilize it for their monitoring needs. To help users quickly adopt the new cloud version, we delivered this guide.

How we structure on-call rotations at Datadog

A well-structured on-call rotation helps you ensure the reliability of your services and meet your customers’ expectations by designating staff to respond to emerging issues. But the pressures of on-call work—such as long shifts, overnight hours, and dynamic situations—can compromise the well-being of your team members. This makes it harder for them to maximize service uptime during their on-call shifts and can limit the velocity of the feature work they do outside of their on-call duty.