Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Streamline Incident Response with Komodor and Squadcast

With the growing popularity of Kubernetes as a container orchestration platform powering the microservices revolution, comes greater complexity with managing, monitoring, and responding to incidents at scale. Challenges with real production environments include full visibility into your clusters and environment’s health, alongside real-time incident management and response.

Using DORA metrics Mean Lead Time for Changes to deliver iterations faster

Here's what you can expect to learn from this article: Raise your hand if you like shipping changes quickly. (Yes, let's assume that everything you're shipping has value and isn't a vanity project). Chances are, you, the person reading this now, agreed with the above. When you start on a project, big or small, you want to keep any changes moving along and avoid getting stuck. The less time between the beginning and end of a project, the faster you can shift your focus to other things.

AWS CloudTrail vs CloudWatch: Features & Instructions

In today’s digital world, cloud computing is necessary for businesses of all types and sizes, and Amazon Web Services (AWS) is undoubtedly the most popular cloud computing service provider. AWS provides a vast array of services, including CloudWatch and CloudTrail, that can monitor and log events in AWS resources. This article will compare AWS CloudWatch and CloudTrail, looking at their features, use cases, and technical considerations.

AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst

At the beginning of 2023, I had a great conversation with Carlos Casanova, a Forrester Principal Analyst, in a recent webinar about how AIOps can help drive successful organizational change. According to our conversation, Carlos has divided the AIOps market into two camps: technology-centric (primarily APM/Observability players) and process-centric. PagerDuty is a process-centric solution leveraging multiple technologies.

Featured Post

After action reports: post-incident investigations

When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches-metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.

Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

In the complex world of Kubernetes, logs serve as the backbone of effective monitoring, debugging, and issue diagnosis. They provide indispensable insights into the behavior and performance of individual components within a Kubernetes cluster, such as containers, nodes, and services.

What Is Root Cause Analysis?

Root Cause Analysis (RCA) is a systematic process designed to uncover the fundamental, underlying issues that lead to IT incidents. These 'root causes' are often masked by surface-level symptoms, making them challenging to identify without a systematic approach. Root Cause Analysis serves as a metaphorical excavation, drilling past the initial problems to discover deeper, hidden issues.

Incident Analysis: Understanding Importance and Benefits

Incidents and accidents can occur in various domains, from information technology and cybersecurity breaches to workplace accidents and transportation mishaps. When faced with such incidents, it becomes crucial to conduct a thorough analysis to understand the underlying causes and implications. Incident analysis goes beyond problem-solving; it offers valuable insights into preventing future occurrences and improving systems and processes.

Introducing powerful APIs and webhooks for Grafana Incident

Grafana Incident, Grafana’s powerful incident response tool, comes with a range of integrations out of the box, including Zoom and Google Meet spaces, GitHub and JIRA issues, and even a Google Doc template for post-incident review documents. However, every team has unique needs and workflows, and you may need to integrate with other systems not currently on our roadmap or even use your own in-house tools.