Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Enhance observability with AI-powered IT operations

Your organization probably relies on a collection of observability tools to track specific elements of its IT stack. You’re not alone; a recent survey from Enterprise Strategy Group showed that most organizations have six or more observability solutions. Our research found that the average BigPanda customer uses 20 observability and monitoring data sources!

Ask the Expert: Insights from Paula Thrasher, Senior Director of Infrastructure and Platform, PagerDuty

In this blog post, Paul Thrasher, Senior Director of Infrastructure and Platform at PagerDuty, provides her takes on the challenges and opportunities facing tech leaders today. From managing complexity to driving operational resilience, Thrasher shares expert insights on how executives can get ahead of disruptions.

The Ultimate Guide for Enterprise DevOps

Speed and reliability in incident management have always been the formula for many businesses’ success. But what happens when this already demanding workflow needs to be done at scale? The answer is adopting enterprise DevOps methodologies to scale operations efficiently. DevOps benefits are magnified when they are correctly scaled across an entire enterprise. In this comprehensive guide, we’ll explore enterprise DevOps’s fundamental principles, challenges, and components.

How we handle sensitive data in BigQuery

As a provider of incident management software, we at incident.io manage sensitive data regarding our customers. This includes Personally Identifiable Information (PII) about their employees, such as emails, first names, and last names, as well as confidential details regarding customer incidents, such as names and summaries. Consequently, we approach the management of this data with a great deal of care.

New BigPanda features accelerate IT incident response

ITOps teams are inundated with a significant volume of alerts each day. Sifting through these alerts to discern which ones are harmless and which could lead to major incidents is a time-consuming and tedious task. This process often involves hunting for information across disparate data sources, tools, and workflows. As a result, the investigation can slow down incident response times, negatively affecting service reliability and customer satisfaction.

3 Ways to Streamline Kubernetes Operations with PagerDuty Automation

Kubernetes popularity continues to grow, with over 60% of organizations maintaining multiple Kubernetes across diverse environments and teams in some capacity. However, as clusters multiply, so do operational challenges: from monitoring hundreds of microservices to responding to and escalating incidents across distributed systems.

Building an AI Chatbot Playground with React and Vite

Read how we set up an experimental chatbot environment that allows us to switch LLMs dynamically and enhances the predictability of AI-assisted features' behavior within the ilert platform. The article includes a guide on how you can build something similar if you plan to add AI features with a chatbot interface to your product.

A Beginner's Guide To Service Discovery in Prometheus

Service discovery (SD) is a mechanism by which the Prometheus monitoring tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime. Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes.

Top 5 outages detected by StatusGator in October 2024

StatusGator’s Early Warning Signals alerted customers to several notable service outages in October 2024. With advanced warning, our users could take proactive measures, minimizing the impact of downtime on their businesses. Here’s a summary of how our detection gave customers an edge over service disruptions, often notifying hours or minutes before the provider even acknowledged the issue.