Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

ChaosSearch Announces New Integration With Opsgenie

ChaosSearch is excited to announce its new integration with Opsgenie — Atlassian’s alerting and incident management platform. Using this integration, your teams can leverage the industry’s most powerful and comprehensive data monitoring and analytics capabilities channeled into a unified workflow through Opsgenie’s easy-to-use interface.

Incident Management with Datadog

When your application experiences an outage, the tools your team uses to manage its response can make all the difference in how quickly they resolve the problem and avoid it in the future. An effective incident management workflow depends on accessible, integrated tools as well as clear, direct channels of communication. And, even after the matter’s been resolved, documentation and analysis of an outage is vital to ensuring it never happens again.

Performing Zabbix Alert Correlation and Incident Acceleration with CloudFabrix AIOps

CloudFabrix AIOps 360 solution can ingest alerts, events, metrics and from various monitoring tools to perform event correlation, alert noise reduction and enable incident resolution acceleration. Learn more about CloudFabrix AIOps 360 In this blog I will cover Zabbix integration aspects with our AIOps 360 solution. Zabbix is one of the popular open source monitoring platforms used by many enterprises and MSPs, including some of our customers.

Attaching incident playbooks to Azure monitor alerts for rapid remediation

Incident response playbooks are a set of actions that need to be executed by your incident repsonders depending on the nature of the outage. Having well defined incident response playbooks can be extremely critical, especially during high customer impact events, that you would typically classify as Sev-0 incidents.

Make Informed Care Decisions With an EHR and Communication Tool Integration

Electronic health records (EHR) are real-time patient health record systems made to securely share patient information with authorized users. Users include those in medical labs, imaging facilities, pharmacies and emergency departments. Essentially, EHRs provide medical information to everyone involved in the patient-care continuum. OnPage continuously explores new ways to expand its value and enhance business processes and workflows to clients.

The Importance of Reliability Engineering

If you’ve spent any time in tech circles lately, there are three letters you’ve surely heard: SRE. Site Reliability Engineering is the defining movement in tech today. Giants like Google and Amazon market their ability to provide reliable service and startups are now investing in reliability as an early priority. But what makes reliability engineering so important?

AIOps Best Practices | First Data/Fiserv: Going Ticketless with AIOps and Moogsoft

At First Data/Fiserv, AIOps dramatically improved incident management and resolution, a transformation that allowed this financial services provider to almost go ticketless. The speakers describe the entire process, started when the CIO called for a global, next-gen monitoring platform. First Data/Fiserv soon realized that Moogsoft’s collaboration and record-keeping capabilities allowed it to slash tickets by 95%. They also describe how the system was fine-tuned to handle both regular and critical incidents transparently.

How SLOs Enable Fast, Reliable Application Delivery

As enterprises adopt DevOps at scale, there is increasing tension between product, operations, and the business to manage competing incentives around release velocity and risk. In this webinar, you’ll learn how adopting a collaborative approach to implementing service level objectives (SLOs) gives software teams and leaders a shared language to focus engineering efforts and optimize the customer experience.

Improving Postmortems from Chores to Masterclass with Paul Osman

In our 2019 Blameless Summit, Paul Osman spoke about how to take postmortems or incident retrospectives to a new level. ‍The following transcript has been lightly edited for clarity. Slides from this talk are available here. Paul Osman: I lead the SRE team at Under Armour. Who here knows about Under Armour as a tech company? Does anybody think about Under Armour as a tech company? Under Armour makes athletic attire, shirts and shoes.