Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Industry Experts Explain how to Thrive in a Post-COVID World

Sep 3, 2020 By Blameless Community In Blameless

With complex architectures, gaining visibility into systems is becoming more difficult. Additionally, with the move to remote work, it’s more important than ever before to adapt to new modes of work such as asynchronous collaboration. So how do we adjust to these changing times? In a CIO panel hosted by Lightspeed Venture Partners, industry experts came together to discuss these questions. Below are key insights from their conversation.

Read Post

Blameless

Read more about Industry Experts Explain how to Thrive in a Post-COVID World

Retail Industry Trends 2020: All-In on Digital Since COVID-19

Sep 3, 2020 By Vivian Chan In PagerDuty

This is the first in a series of posts we’ll be publishing on trends we’re seeing in the retail industry and how IT organizations tasked with deploying and maintaining flawless digital customer experiences can take advantage of PagerDuty to ensure always-on reliability. It’s been a tough year for retail.

Read Post

PagerDuty

Read more about Retail Industry Trends 2020: All-In on Digital Since COVID-19

Fiserv Eliminates Ticket Overload with AIOps

Sep 3, 2020 By Juan Perez In Moogsoft

Fiserv, the Fortune 500 payments and financial technology provider, needed to streamline and automate its IT incident management process to detect and fix issues earlier and more quickly. The incident management workflow was complex, primarily because mergers and acquisitions over the years had made Fiserv’s IT environment very heterogeneous. “The challenges we were facing were enormous,” IT Director Chris Kreps says.

Read Post

Moogsoft

Read more about Fiserv Eliminates Ticket Overload with AIOps

DevOpsDays Chicago 2020 Wrapup

Sep 3, 2020 By Rich Burroughs In FireHydrant

DevOpsDays Chicago 2020 was held on September 1, online. It was the first time the conference was held virtually due to the coronavirus pandemic. I was excited to attend for a couple of reasons. First, DevOpsDays Chicago is one of the better known and respected DevOpsDays held in the US. I’d never been able to attend it before, so it was great to get the opportunity. Also, I’d been missing the DevOpsDays community.

Read Post

FireHydrant

Read more about DevOpsDays Chicago 2020 Wrapup

Determining Error Budgets and Policies that Work for Your Team

Sep 2, 2020 By Hannah Culver In Blameless

SLOs are key pillars in organizations’ reliability journeys. But, once you’ve set your SLOs, you need to know what to do with them. If they’re only metrics that you’re paged for once in a blue moon, they’ll become obsolete. To make sure your SLOs stay relevant, determine error budgets and policies for your teams. In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.

Read Post

Blameless

Read more about Determining Error Budgets and Policies that Work for Your Team

AIOps - Done the Self-Service Way

Sep 2, 2020 By Shai Israel In BigPanda

Last week I went camping with some friends. One of them did the shopping for all of us, so I sent him my share using a payment app. It took me less than 2 minutes to complete the transaction. A few years ago, a similar transaction would have me going to the bank to complete the task, or at a minimum, calling a bank teller and having him do it. Try to imagine a bank asking its customers to do any of these things today. It would probably lose all its customers in no time.

Read Post

BigPanda

Read more about AIOps - Done the Self-Service Way

How to Build Your SRE Team

Sep 1, 2020 By Emily Arnott In Blameless

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs, to management upholding the virtue of blamelessness, to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

Read Post

Blameless

Read more about How to Build Your SRE Team

Archive Incident Slack Channel

Sep 1, 2020 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about Archive Incident Slack Channel

Datadog and Relay for Incident Response

Sep 1, 2020 By Eric Sorenson In Puppet

Datadog is an awesome tool for aggregating and visualizing the metrics that matter to you. Recently, Datadog launched a new Incident Management feature, which allows you to coordinate the activities around a problem that affected your service. In this example, I’ll walk through using Relay to roll back a Kubernetes deployment that caused a service impact, and show how the Datadog Incident timeline can keep everyone working on the incident in sync.

Read Post

Puppet

Read more about Datadog and Relay for Incident Response

How SIGNL4 solves typical problems in network monitoring

Sep 1, 2020 By Matt In SIGNL4

A new article in the September issue of German magazine LANLine (“Automation creates productivity”) summarizes typical challenges and problems in network monitoring very well and is worth reading. I would like to briefly discuss some of the problems addressed and how our product SIGNL4 was developed as a solution for exactly these problems.

Read Post