Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Sponsored Post

Scaling Site Reliability Engineering Teams the Right Way

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

Forgot to declare an incident? Add it retroactively in FireHydrant.

Have you ever quickly worked through an issue with your team and later thought, “Huh. That probably should have been an incident.” It happened to us just a few weeks back. After one of our engineers surfaced a failed build, a few folks chimed in to problem solve and within 30 minutes things were up and running like normal. But we probably should have declared an incident.

New Features: Next-Generation Notifications UI, Take-On Call Widget, Alert Templates, Dynamic Policy Routing, Service Groups

This post highlights some of the features and improvements that we have released in the last two months. If you want to submit your own ideas or vote on existing feature requests, you can now use our public roadmap at roadmap.ilert.com.

SIGNL4 Onboarding: Scheduling - Creation & Options

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Todays video focuses on Scheduling users for duty shifts. Learn how to schedule users for SIGNL4 shifts and about the scheduling options and how they affect your team and schedule. Learn how to create a schedule and then copy this schedule so you only have to create it once. This video is packed with helpful tips to help you get the most out of your account.

How to get started with BigPanda Incident Intelligence and Automation powered by AIOps

If you’re in IT operations or manage NOC, SRE, and DevOps teams, chances are your IT environment is growing complex for you and your teams to manage. Any enterprise, large or small, around the globe, is continuously changing its IT stack due to evolving business requirements and significant industry trends. But digital transformation, hybrid infrastructure, DevOps adoption, and continuous integration and continuous delivery (CI/CD) pipelines are all causing major headaches.

The Dangers of Alert Fatigue: Strategies for Effective Alert Management

Alert fatigue is a serious issue that affects numerous professions, especially in the IT industry. It can lead to neglecting critical events and delaying response times. IT teams need to continuously monitor their systems and applications to avert possible downtime and keep operations running smoothly. However a high number of incoming alerts inundating these teams can make them less responsive. The ramifications of such disregard can severely affect the efficiency and dependability of IT teams.

User story: How a global media company reduced costly outages by implementing a secure DevSecOps collaboration platform

Catastrophic failures — such as a security breach or a complete outage leading to an unavailable product or service — are classified as Sev0 incidents. On a severity scale of 1–3, Sev0 is dire. It brings business to a complete standstill and may lead to loss of revenue and a damaged reputation. A Sev0 incident usually has no quick workaround; it requires a coordinated effort beyond the engineering team to diagnose, correct, and manage.