Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

How to Use Prometheus for APM

Monitoring applications shouldn’t be a guessing game. But too often, DevOps engineers end up buried under a pile of metrics that don’t help when things go wrong. That’s where Prometheus APM comes in. It offers a straightforward way to make sense of your systems—especially when you're working with modern, distributed setups like microservices.

Squadcast Strengthens Its Leadership in IT Alerting and Incident Management in the G2 Spring Report

2025 has already started out to be a remarkable year for Squadcast—with our key wins in the G2 Spring Reports, our acquisition by SolarWinds, and a series of impactful product releases and improvements. Our mission has always been clear: to deliver a unified platform that seamlessly integrates On-Call Management and Incident Response, empowering teams to boost service reliability and productivity—all without the burden of context switching.

Comparing ELK, Grafana, and Prometheus for Observability

Monitoring and observability are cornerstones of modern infrastructure management. Three popular solutions that often come up in this space are the ELK Stack, Grafana, and Prometheus. This comparison breaks down the key differences, use cases, and integration capabilities to help you determine which tool or combination better suits your operational needs.

Metrics That Matter: Measuring Developer Productivity in the AI Era

In this episode, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at Ninety.io, for a conversation that cuts through the noise around developer productivity metrics and AI. Mark dives deep into how teams can measure what matters—without falling into the trap of turning every measure into a target. He shares how tools like Developer NPS, DORA metrics, and balanced scorecards can help teams optimize for both output and well-being—but only when framed with the right intent.

Incident management vs. problem management: A practical guide for SREs

In Site Reliability Engineering (SRE), distinguishing incident management from problem management is crucial. While both processes aim to maintain system reliability, they fulfill distinct roles: incident management focuses on quickly resolving immediate disruptions, whereas problem management identifies and rectifies root causes to prevent recurrence. Effectively combining these processes helps minimize downtime, enhances system resilience, and fosters a proactive operational approach.

Java Util Logging Configuration: A Practical Guide for DevOps & SREs

Setting up proper logging is like having a good navigation system when you're driving through unfamiliar territory. For DevOps engineers and SREs managing Java applications, understanding how to configure the built-in java.util.logging framework is essential knowledge that can save you hours of troubleshooting headaches. Let's break down java util logging configuration in a way that makes sense — no fancy jargon, we promise!

How to View and Understand VPC Flow Logs

If you're running workloads in AWS, you've probably heard about VPC Flow Logs. These logs are your eyes and ears for network traffic in your Virtual Private Cloud, and knowing how to check them properly can save you hours of troubleshooting headaches. Whether you're tracking down connectivity issues or monitoring for suspicious activity, this guide will walk you through checking VPC flow logs step by step, with practical examples you can apply today.

Envoy vs HAProxy: Which Proxy Server Is Right for Your Infrastructure?

Choosing between Envoy and HAProxy isn't just about picking a proxy server. It's about deciding which tool will handle your traffic, balance your loads, and keep your services running when everything else wants to crash. If you're a DevOps engineer or system admin weighing these options, you're in the right place.