Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Working with multiple on-call teams using Zabbix and iLert

This post outlines how to use Zabbix and iLert with multiple on-call teams, where each team is responsible for a set of host groups in Zabbix, and therefore, will only receive alerts for the services it is responsible for. But first, let’s start with the basic needs when being on-call.

Retail Industry Trends 2020: All-In on Digital Since COVID-19

This is the first in a series of posts we’ll be publishing on trends we’re seeing in the retail industry and how IT organizations tasked with deploying and maintaining flawless digital customer experiences can take advantage of PagerDuty to ensure always-on reliability. It’s been a tough year for retail.

Fiserv Eliminates Ticket Overload with AIOps

Fiserv, the Fortune 500 payments and financial technology provider, needed to streamline and automate its IT incident management process to detect and fix issues earlier and more quickly. The incident management workflow was complex, primarily because mergers and acquisitions over the years had made Fiserv’s IT environment very heterogeneous. “The challenges we were facing were enormous,” IT Director Chris Kreps says.

DevOpsDays Chicago 2020 Wrapup

DevOpsDays Chicago 2020 was held on September 1, online. It was the first time the conference was held virtually due to the coronavirus pandemic. I was excited to attend for a couple of reasons. First, DevOpsDays Chicago is one of the better known and respected DevOpsDays held in the US. I’d never been able to attend it before, so it was great to get the opportunity. Also, I’d been missing the DevOpsDays community.

AIOps - Done the Self-Service Way

Last week I went camping with some friends. One of them did the shopping for all of us, so I sent him my share using a payment app. It took me less than 2 minutes to complete the transaction. A few years ago, a similar transaction would have me going to the bank to complete the task, or at a minimum, calling a bank teller and having him do it. Try to imagine a bank asking its customers to do any of these things today. It would probably lose all its customers in no time.

Datadog and Relay for Incident Response

Datadog is an awesome tool for aggregating and visualizing the metrics that matter to you. Recently, Datadog launched a new Incident Management feature, which allows you to coordinate the activities around a problem that affected your service. In this example, I’ll walk through using Relay to roll back a Kubernetes deployment that caused a service impact, and show how the Datadog Incident timeline can keep everyone working on the incident in sync.

How SIGNL4 solves typical problems in network monitoring

A new article in the September issue of German magazine LANLine (“Automation creates productivity”) summarizes typical challenges and problems in network monitoring very well and is worth reading. I would like to briefly discuss some of the problems addressed and how our product SIGNL4 was developed as a solution for exactly these problems.

Incident Management Process: 5 Steps to Effective Resolution

An incident management process is a set of procedures and actions taken to respond to and resolve critical incidents: how incidents are detected and communicated, who is responsible, what tools are used, and what steps are taken to resolve the incident. Incident management processes are used across many industries, and incidents can include anything from IT system failure, to events requiring the attention of healthcare professionals, to critical maintenance of physical infrastructure.

Customize your Enterprise Alert dashboard

There is nothing more frustrating for IT Professionals than having to go to multiple places and sometimes into multiple systems to track down an issue. Yes, it is the job, but with Enterprise Alert, we provide a single pane of glass that contains all events, policies, and alert notifications in one place. The next question we asked is, “Is all of the relevant data easily accessible, and can it be viewed from one central screen”?