Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

How to Reduce MTTR: A Complete Guide

Organizations striving to improve their operational efficiencies must know how to reduce MTTR as it plays a key role in today’s fiercely competitive business landscape. Customer satisfaction is a top priority for most businesses and late response to their queries or issues can have a negative impact. To track the response and resolution time, businesses measure their MTTR score. MTTR is a key metric that gives insight as to how much time an organization takes to resolve an incident or issue.

How observability and AIOps work better together

If you’re juggling complex, cloud-based, containerized systems and aiming to meet high customer expectations, your old monitoring processes probably don’t cut it anymore. Increasing infrastructure complexity means you need to instrument more, log more, and monitor more. That leads to even more complexity. The answer is better observability, right? Yes and no. Observability and monitoring are critical, but they are only part of what you need for service awareness and availability.

Captains Log: A first look at our architecture for Signals

Welcome to the first Signals Captain’s Log! My name is Robert, and I’m a recovering on-call engineer and the CEO of FireHydrant. When we started our journey of building Signals, a viable replacement for PagerDuty, OpsGenie, etc, we decided very early that we would tell everyone what makes Signals unique, and what better way than to tell you how we’re building it (without revealing too much 😉). Let’s jump in.

The New SEC Rules and You

The Securities and Exchanges Commission published new rules for SEC registrants around disclosing incident details and response policies. Compliance with these new rules should be top of mind for any company – even if your org hasn’t hit the milestone of registering with the SEC, you should be prepared to be compliant when you take that step. ‍

What you need to know about the The Digital Operational Resilience Act (DORA)

The European Commission has introduced the Digital Operational Resilience Act (DORA) to bolster the digital infrastructure of the financial sector within the European Union (EU). As part of the EU's wider digital finance strategy, DORA's objective is to create a comprehensive framework governing digital operational resilience. Financial institutions must ensure full compliance with DORA by January 2025.

Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

Site Reliability Engineers (SREs) play a vital role in ensuring the stability and performance of web services and are key in incident management. One of the core skills SREs need is the ability to conduct effective Root Cause Analysis (RCA) when issues arise. This guide is about how to improve your RCA skills for more effective post-incident analysis.Let's dive in.🔖 What is Prometheus Alertmanager? Read here!

How we manage incidents at Datadog

Incidents put systems and organizations to the test. They pose particular challenges at scale: in complex distributed environments overseen by many different teams, managing incidents requires extensive structure and planning. But incidents, by definition, break structures and foil plans. As a result, they demand carefully orchestrated yet highly flexible forms of response. This post will provide a look into how we manage incidents at Datadog. We’ll cover our entire process.

The Journey Into Automation: Optimizing Care Delivery

In a world where efficiency and precision are the cornerstones of progress, automation has become the unsung hero across diverse industries. From manufacturing floors to customer service, its transformative power has reshaped the way we work and deliver services. Today, we embark on a journey to explore the profound influence of automation on healthcare, where each automated process is a progressive step towards optimizing care delivery and reshaping the future of patient-centered care delivery.

Suppressing Alert Noise during Scheduled Maintenance

Alert noise is a common problem for IT teams that monitor and manage complex systems. Excessive unactionable alerts triggered by various sources, such as applications, servers, network devices, etc., can cause alert fatigue. The higher volume of alerts can be overwhelming, reducing the ability to respond to critical alerts. One event of possible alert noise is during scheduled maintenance, awhich is a common practice in the digital realm.

6 Best Practices for Tuning Network Monitoring Alerts

Network monitoring and alerting provide the foundation for efficient IT operations and cyber resilience. By keeping track of the status and performance of network infrastructure and applications, network monitoring tools can automatically generate alerts when defined thresholds are exceeded or specific events occur. These network monitoring alerts allow IT teams to detect outages, performance degradation, and potential security incidents so they can respond swiftly to minimize disruption.