Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Sponsored Post

Infrastructure monitoring using kube-prometheus operator

Prometheus has emerged as the de-facto open source standard for monitoring Kubernetes implementations. In this tutorial, Kristijan Mitevski shows how infrastructure monitoring can be done using kube-prometheus operator. The blog also covers how the Prometheus Alertmanager cluster can be used to route alerts to Slack using webhooks. In this tutorial by Squadcast, you will learn how to install and configure infrastructure monitoring for your Kubernetes cluster using the kube-prometheus operator, displaying metrics with Grafana, and configuring alerting with Alertmanager.

Build custom API integrations with incident.io

We’re building incident.io as the single place you turn to when things go wrong. When an issue is disrupting your business-as-usual, the last thing you want is to start opening ten different tools to diagnose and fix it! As your central incident hub, we need to give you two powers: Workflows cover the former. Workflows are like a mini incident.io Zapier.

How to Pick the Best Incident Response Software

With the rising complexity of our digital ecosystems, incidents are occurring at an unprecedented rate. To combat the additional strain, incident responders are looking to software to help them establish a scalable, repeatable incident response process that reduces toil and noise and gets the right people on the scene at the right time. The best incident response software addresses the entire lifecycle of an incident.

AIOps' certainty in an uncertain future

BigPanda’s recent coronation as a Unicorn has prompted its leaders to look to the future of IT Operations and how it relates to artificial intelligence (AI) and machine lifiearning (ML). What is BigPanda’s role in improving IT Ops? How can AIOps contribute to greater achievement in global enterprises? These are questions a VP of Product Marketing like BigPanda’s Mohan Kompella, who has spent 15+ years in IT Operations, has been asking.

Sponsored Post

Your Goals Could Be Holding Your DevOps Teams Back

In the era of Agile, organizations are increasingly moving their IT service management teams toward a DevOps world. There are significant challenges to transforming ITSM to DevOps, but one of the most significant is goal setting. In today's face-paced business environment, it's important to establish the parameters for measuring success and determine which objectives teams need to meet to accomplish business goals.

SRE: From Theory to Practice | What's difficult about on-call?

We launched the first episode of a webinar series to tackle one of the major challenges facing organizations: on-call. SRE: From Theory to Practice - What’s difficult about on-call sees Blameless engineers Kurt Andersen and Matt Davis joined by Yvonne Lam, staff software engineer at Kong, and Charles Cary, CEO of Shoreline, for a fireside chat about everything on-call. As software becomes more ubiquitous and necessary in our lives, our standards for reliability grow alongside it.

How Well Does Your Infrastructure Support Major Incident Management?

Effective major incident management depends on many things, including planning, precise execution, effective communication, and applying learnings from previous incidents to update those plans. Traditional major incident management wisdom addresses the importance of the remediation process, but it doesn’t speak on the issue of configuring your IT infrastructure.

SRE Adoption | A 2-Year Retrospective (From A Business Point-Of-View)

This month I hit my 2-year anniversary with Blameless and as our industry progresses and matures, I thought it would be a good opportunity to look back and review how far we have come and also ruminate on where we’re headed. Our shared vision at Blameless is to help engineering teams adopt reliability practices with ease and advance to a resilient culture.