Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

xMatters Overview - xMatters Demo

Join Stephen Walters, Solutions Architect and DevOps Institute Ambassador, and Daniel Topham, Solutions Architect, as they guide you through a high-level demo of the xMatters solution. See how xMatters sends alerts to the right users at the right time and enriches notifications with relevant data. And, learn how easy it can be to use Flow Designer to integrate different tools and software to create innovative workflows with drag and drop capability.

Workflow Form Layout - xMatters Support

In xMatters, the form layout is where you customize the content and options that are available to the message sender. You can use the form layout to do things like predefine recipients for your messages, add a conference bridge, attach documents, specify a customized sender display name, or add a map that the sender can use to target users at specific sites.

Amplify Artifactory and Distribution Changes Through PagerDuty

When automated software delivery runs smoothly, it can whisper, and quietly attend to itself. But when your delivery and distribution pipeline runs into a problem, it must shout. Boosting the volume of Artifactory and Distribution change events and issues through PagerDuty can help ensure they’re heard by everyone whose job it is to monitor your software delivery pipeline.

Kubernetes Health Check Using Probes

Kubernetes is an open source container orchestration platform that significantly simplifies an application's creation and management. Distributed systems like Kubernetes can be hard to manage, as they involve many moving parts and all of them must work for the system to function. Even if a small part breaks, it needs to be detected, routed and fixed. These actions also need to be automated. Kubernetes allows us to do that with the help of readiness and liveness probes.

Mastering Digital Operations Across the Enterprise

I’m excited to announce that today, PagerDuty is taking our automation capabilities to new scale and scope as we enter into a definitive agreement to acquire Catalytic. With their technology and talented team we accelerate the delivery of enterprise-wide process automation that manages no-code workflows across the business, broadly applicable to any workflow, for any employee.

Postmortems Now Called Retrospectives in Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Customizing Error Pages (Nginx Ingress Controller)

The most common way to do it, which is part of the offical solution is to create a Docker image server capable of responding to any request with 404 content, except /healthz and /metrics. This could be an Nginx instance. /healthz should return 200 /metrics is optional, but it should return data that is readable by Prometheus in case you are using it for k8s metrics. Note: Nginx can provide some basic data that Prometheus can read. /returns a 404 with your custom HTML content.

Alert Fatigue in SRE: What It Is & How To Avoid It

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

The BigPanda ScaleUp Journey: Human/AI Collaboration, Predictive Accuracy, and Scale Power in AIOps

At the beginning of the COVID-19 pandemic, we anticipated a slow-down in IT-related spending. In reality, the opposite occurred. Companies massively expanded their digital offerings using the same IT staff they’d had pre-pandemic, even as the teams lost access to many of their existing tools while working from home. This acceleration put immense pressure on IT teams everywhere, resulting in messy incident management, outages, and a huge shortage of talent.