Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

When More Incident Commanders are Better

It has been lightly revised and reposted with his permission from the original article on Medium. Leading major incident responses can be extremely stressful. You have to quickly gather an ad-hoc team, figure out what went wrong, identify a fix and make sure this doesn't make things worse, all the while with senior leadership breathing down your neck. Are we having fun yet? Many people think having a dedicated incident commander role will solve the problem.

Captain's Log: Diving into our scheduling design

On-call scheduling is tricky. Like, really tricky. It was one of the scariest parts when we decided to build a modern alerting system earlier this year. We knew we couldn't cut any corners on Day One of our release because it needed to be a fully loaded feature for someone to realistically use our product (and replace an incumbent). This meant including windowed restrictions, coverage requests, and simple to complex rotations.

On-Call Management Models

In today's fast-paced digital landscape, incident management is crucial for maintaining operational excellence. During this process, on-call management models play a critical role in promptly addressing and resolving incidents. On-call management involves the organization of teams to ensure prompt response and resolution of incidents and is necessary to streamline incident resolution, ensure 24/7 availability, and allow for fair and transparent on-call rotations.

Ping Command: A Comprehensive Guide to Network Connectivity Tests

The ping network test, a core utility since the 80s, plays a crucial role in confirming connectivity between IP-networked devices. In this guide, we'll delve into what the ping command is, how to run a ping network test, common IP addresses to ping, interpreting results, and troubleshooting errors.

Events vs. Alerts vs. Incidents

Event. Alert. Incident. These terms are bandied about, often interchangeably, in IT operations management. Broadly speaking, they all refer to situations where something is potentially amiss and needs to be investigated and resolved. Each of these three words does, however, have a distinct definition. Because they are used in scenarios where clear communication and timeliness are critical, it’s important to understand the differences and use them appropriately.

Learn the Incident Response Life Cycle - Best Practices and Strategies

No company plans for a security breach, major outage, or other cyber incident, but they happen. When an incident occurs, having a standardized, regulated method of managing the fallout is critical. This is where the incident response life cycle comes in ‍

How to Route Alerts to Subject Matter Experts Using Squadcast Tagging & Routing Rules?

Effective Incident Management is crucial for ensuring customer satisfaction and brand loyalty. As systems grow more complex, efficiently directing alerts to the right teams becomes crucial. This article delves into the challenges, implementation, and benefits of automating incident categorization.

How to improve your IT alert management: Understanding best practices

As an IT leader, you’re under significant pressure to control the constant alerts. Somehow, you must manage non-stop IT alerts while also ensuring ultra-high service availability. The task is far from easy, and even the most sophisticated teams struggle to keep up and turn alerts into action with tech stacks that are constantly growing in size and complexity. IT alert management is the first line of defense.

Your guide to better incident status pages

Your status page (or lack thereof) has the opportunity to signal a lot about your brand — how transparent you are, how quickly you respond to incidents, how you communicate with your customers — and ultimately, this all seriously impacts your reliability. After all, as our CEO Robert put it in a recent interview on the SRE Path podcast, you don’t get to decide your reliability; your customers do.