Operations | Monitoring | ITSM | DevOps | Cloud

Breaking siloes: How to use cross-store correlations with Grafana

Grafana is great at hopping between signals in its native backends (Grafana Loki, Grafana Mimir, Grafana Tempo). But your data doesn’t have to live there to get the same smooth workflow. Afterall, we don’t just pay lip service to our “big tent” philosophy—we want to meet all our users’ diverse needs, regardless of what kind of data you have or where you store it.

Coordinate large-scale engineering initiatives with IDP Campaigns

As organizations grow, engineering leaders often need to drive cross-team initiatives such as reducing cloud spend, upgrading runtimes, or strengthening security controls. Tracking this work can quickly become fragmented across spreadsheets, dashboards, and status meetings. Progress is hard to measure, accountability is unclear, and the impact of each effort can be difficult to demonstrate.

The Rise of Experience-Level Agreements (XLAs) in Practice: A Deep Dive into ITSM Transformation

For decades, the backbone of IT Service Management (ITSM) has been the Service-Level Agreement (SLA). While effective for tracking the nuts and bolts of IT delivery, SLAs have one critical blind spot: they say little about how users actually feel about their IT experiences. This is where Experience-Level Agreements (XLAs) and Digital Employee Experience (DEX) fill in the rest of the picture.

The Hidden Cost of Untagged Cloud Resources for SMBs

Cloud computing is a powerful enabler of growth and agility for small and medium businesses (SMBs). However, untagged cloud resources are one of the primary challenges most SMBs face in cloud environments. These untagged resources lead to a lack of visibility and accountability over cloud spending, which leads to wasted budgets and cost overruns.

Icinga Notifications v0.2.0 Release

Some of you might have already heard about this at OSMC, or you may have received a release notification from GitHub already: our Icinga Notifications project made a step forward and we are happy to announce that version 0.2.0 is now available for you to try out. It addresses feedback that we have received for the previous versions with the most important changes highlighted below.

From Reactive Response to Systemic Resilience: The System That Gets Smarter With Every Incident

Most operations teams are stuck in a reactive loop: Resolving incidents as they happen, then moving on to fight the next fire. This approach keeps things running in the short term, but prevents responders from documenting their learnings in a way that improves overall system resilience. There are practical reasons for this.

Top 7 Observability Platforms That Auto-Discover Services

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.

What to Expect When You Migrate to Atatus APM

As organizations aim for exceptional software reliability and user satisfaction, migrating to Atatus APM is a key upgrade in application monitoring. With nearly 80% of companies facing costly downtime exceeding $300,000 per hour, robust APM solutions like Atatus are crucial. It helps teams quickly identify bottlenecks, optimize performance, and improve the customer experience through comprehensive, real-time insights.

Incident Management vs Change Management: Key Differences Explained

The Incident Management vs. Change Management are two such moments that highlight a core difference teams face every day. One is a reaction to failure. The other is a planned improvement. That’s the heart of incident management vs. change management. Both keep systems reliable, and both help teams move faster without breaking things. Let’s explore how they differ and how they work together.

4 Golden Signals of System Reliability: A Practical Guide for Your Team

Modern systems produce endless streams of metrics. CPU usage, request volume, cache hit rates, node counts, queue depth, the list keeps growing. With this much data, it’s easy for teams to get lost in dashboards without knowing what actually matters. That’s why DevOps and SRE teams rely on the 4 Golden Signals of System Reliability. They provide the simplest and clearest way to understand user experience and system health.