Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Resolution for Remote Teams

People working in IT support and incident management right now are faced with unusual difficulties supporting large remote workforces and managing unpredictable workloads. On Reddit, system admins and other IT pros are bemoaning the hiccups and hassles of working in isolation while trying to resolve issues and maintain high SLAs. You can’t go grab your indispensable SME for troubleshooting, because that person is also home and inundated with messages and alerts from many different tools.

Deserted Island DevOps Recap

April 30, 2020 Austin Parker, Principal Developer Advocate at Lightstep and co-host of On-Call Me Maybe, hosted a one-of-a-kind DevOps conference. With the cancellation of events all over the world in the face of COVID-19, virtual conferences have been blooming (see our coverage of Failover Conf here), but Deserted Island DevOps was the first ever conference held in the world of Animal Crossing: New Horizons.

Monitoring service health and downtime events within your Google Cloud with Zenduty

Google Cloud Platform (GCP) is a collection of Google’s computing resources, made available via services to the general public as a public cloud offering. The GCP resources consist of physical hardware infrastructure — computers, hard disk drives, solid-state drives, and networking — contained within Google’s globally distributed data centers, where any of the components are custom designed using patterns similar to those available in the Open Compute Project.

Configure an Intuitive Service Dashboard & Reduce Response Time

Leverage Multiple Alert Sources in Squadcast to reflect your actual system infrastructure on your Service Dashboard Having your Incident Management Tool reflect your system architecture is a big milestone in reducing cognitive load on your on-call team. In order to help our users move one step closer to this milestone, we recently released the functionality to add multiple alert sources to a service. You can now model your service dashboard to mimic your actual system architecture.

Take huge leaps with Honeycomb for Incident Response

As engineering teams shift from delivering services on monolithic architectures to microservices and even serverless environments, developers are no longer just responsible for creating and maintaining their code. Shared ownership has become the new normal (or at least trending towards) and so they are now responding to production incidents and in some cases in the on-call rotation. Of course incidents vary in terms of impact, but they do take time away from innovation and creating new capabilities.

How resilience and security shift left: An interview with the EVP Product & Engineering and CISO at FOX

Melody Hildebrandt is the Executive Vice President of Product & Engineering and CISO at FOX. Her career journey began with designing wargames for the Department of Defense. She has gained tremendous experience in the world of disaster planning, testing, security, and resilience from organizations like Palantir and more. Recently, she led the effort to plan for and execute FOX’s digital streaming of Super Bowl 54, including taking over an entire sound stage in the process.