Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Unified Incident Management: Merits of Combined On-Call and Incident Response | Squadcast

In this session, we explore the crucial aspects of effective on-call management and incident response in product organizations. Squadcast combines On-Call and Incident Response into a single platform using automation capabilities for enhanced reliability, continuous learning, and better productivity. 🔍 Timestamps.

Choosing the Right Career Path in Tech: Software Engineering vs. Site Reliability Engineering (SRE)

The tech industry is booming, and there are many different career paths. But, two of the most popular and in-demand roles are Software Engineering and Site Reliability Engineering (SRE). Site Reliability Engineering (SRE) blends elements of software engineering with IT operations, focusing on reliability. On the other hand, SWE Software Engineering involves designing, developing, testing, and deploying software applications.

October 2023 Update - New layout, additional cross links, improved event filtering and much more

Our October update brings a new layout in the web portal, new additional cross-references from Signl details to linked entities, and improved grouping options for conditions in the distribution rules. As always, all the details are in this blog article.

What is Mean Time Between Failures - and why does it matter for service availability

Mean Time Between Failures (MTBF) measures the average duration between repairable failures of a system or product. MTBF helps us anticipate how likely a system, application or service will fail within a specific period or how often a particular type of failure may occur. In short, MTBF is a vital incident metric that indicates product or service availability (i.e. uptime) and reliability.

Enhance Your Customer Service with PagerDuty for ServiceNow CSM

In today’s fast-paced, digital-first landscape, delivering exceptional customer experience is paramount to business success. For customer service teams, that means maintaining service level agreements (SLAs) and ensuring swift responses to customer issues that can make or break your company’s reputation. Fortunately, PagerDuty has improved the way companies handle customer service teams and has built applications into ServiceNow’s CSM platform.

Alerting, Incident Management and the SDLC | Better Incidents Podcast Ep. 8

In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.

Global Event Rulesets: Streamlining Alert Routing Across Services

In the fast-paced world of organizations handling numerous microservices and projects, tackling the challenges that arise can be a daunting task. As many of our customers come with infrastructures that included a large number of microservices we set out to make it easier for them to streamline alert source management. Enter Global Event Rulesets (GER). This feature is designed to redefine the way you manage alerts.

Whose fault was it anyway? On blameless post-mortems

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident. Sure, you know you were the one who made the final change that caused the incident. And hopefully, it was a small one that didn’t cause any SEV-1s. Still, the weight of knowing you caused something bad should be enough, right? Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone.

Choosing the Right Metrics for Noiseless K8s Alerting

Watch Ankur Rawal and Dheeraj Reddy talk about how to choose the right metrics for noise K8s alerting, with insights and suggestions based on the mistakes made by hundreds of companies while implementing Prometheus Alertmanager in their production systems, and learn how much bad monitoring could be costing you. This talk was delivered at PromCon'2023 in Berlin.