Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Discover Infrastructure: Kubernetes & Hosts - Launch Week / Day 03

Stop debugging infrastructure issues across multiple dashboards. See how Last9's Discover Infrastructure monitors K8s pods and traditional hosts together—with resource analysis, pod-level debugging, and AI that correlates app problems to infrastructure root causes. One setup (K8s + host monitoring) → Complete infrastructure visibility that connects to your services and jobs. No more blind spots between application performance and underlying resources.

Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)

What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner (Razor, ex-Zalando) shares her unique journey from frontend and backend engineering to leading site reliability practices. She explains why protecting the user journey is the key to effective incident management, how SLOs cut through noisy alerts, and why observability must come first.

What is Real User Monitoring

Real User Monitoring (RUM) measures how real users interact with your application in production. Unlike synthetic monitoring, which relies on scripted tests, RUM collects data from actual sessions. This means performance is observed across different devices, networks, and usage patterns. The result is a clear view of how the application behaves under real conditions, where latency is introduced, which features take longer to load, and at what points users drop off.

Your APIs Are Green. Your Background Jobs Are Dying.

Launch Week Day 2: Introducing Discover Jobs Your dashboard looks perfect. APIs responding in 80ms. Error rates at 0.02%. Kubernetes pods healthy. Everything's green. Then Slack explodes: "Why didn't my invoice generate?" "Where's my password reset email?" "The data export I requested yesterday is still processing?" You check your job queue. Sidekiq dashboard shows 47,000 jobs processed today. Redis looks fine. Workers are running. But somehow, your business logic is silently falling apart.

How to Build a Strategic Roadmap for Site Reliability Engineering Implementation

Getting your site reliability engineering solutions in place can seriously boost how your systems perform. But implementing site reliability engineering (SRE) isn't a simple flip of a switch-it's a process. If you want to keep your systems running smoothly, with minimal downtime and top-notch performance, you need a solid, strategic plan. This roadmap should guide you step-by-step, from setting clear goals to constantly improving your processes.

The Service Discovery Problem Every Developer Knows (But Pretends Doesn't Exist)

Launch Week Day 1: Introducing Discover Services Picture this: It's 2 AM, alerts are firing, and you're staring at a dashboard trying to figure out which service is causing the cascade of failures. Your service map is a six-month-old Miro board, and you have no idea what's actually talking to what in production right now. If you've been there, you're not alone. In fast-moving teams, new services get deployed faster than you can track them.

Site Reliability Engineering vs DevOps: Which Approach Fits Your Organization?

Choosing between Site Reliability Engineering (SRE) and DevOps can feel like picking between two similar but distinct philosophies. Both aim to improve software delivery and system reliability, but they take different paths to get there. Understanding these differences helps you make an informed decision about which approach aligns best with your organization's goals, culture, and technical needs.