%term

The latest News and Information on Service Reliability Engineering and related technologies.

What is Database Monitoring? A Guide for Developers, DevOps, and SREs

Sep 15, 2025 By Pavithra Parthiban In Atatus

Databases handle critical operations for applications, from online banking to e-commerce and streaming services. Any slowdown or failure can directly affect application performance and user experience. Database monitoring tracks performance, detects issues, and helps prevent downtime. It also ensures efficient use of resources, maintains security, and supports compliance requirements.

Read Post

Atatus

Read more about What is Database Monitoring? A Guide for Developers, DevOps, and SREs

Background Job Observability Beyond the Queue

Sep 15, 2025 By Anjali Udasi In Last9

Background jobs handle the critical work that happens outside the request path: processing payments, sending emails, generating reports, syncing data. They keep applications running smoothly, but the signals they produce look different from API endpoints. Most teams start with queue metrics—how many jobs are waiting and how quickly they complete. These metrics provide the foundation, but job health extends beyond throughput.

Read Post

Last9

Read more about Background Job Observability Beyond the Queue

What is Service Catalog Observability and How Does It Work?

Sep 12, 2025 By Faiz Shaikh In Last9

A service catalog gives teams a shared view of their systems—what services exist, who owns them, how dependencies are structured, and the SLAs that guide expectations. It’s an important part of development infrastructure because it helps everyone speak the same language about services. Service catalog observability builds on that foundation.

Read Post

Last9

Read more about What is Service Catalog Observability and How Does It Work?

APM for Kubernetes: Monitor Distributed Applications at Scale

Sep 10, 2025 By Anjali Udasi In Last9

When a payment service runs across 12 pods — each serving different customer segments — and an authentication layer spans three namespaces, performance issues can originate in both the application code and the orchestration layer. The challenge is linking request-level performance data with what’s happening inside the cluster: container CPU limits, pod scheduling decisions, and node-level events.

Read Post

Last9

Read more about APM for Kubernetes: Monitor Distributed Applications at Scale

The End of "Good Code"? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

Sep 10, 2025 By Rootly In Rootly

Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of *Humans of Reliability*, Rob Zuber, CircleCI CTO, joins Sylvain to explore how coding assistants are reshaping developer workflows and changing what teams value. Rob shares what he’s seeing across CircleCI’s customer base: a clear boost in throughput, new bottlenecks shifting from code creation to code review, and the rise of “vibe coding,” where engineers trust AI-generated code they may not fully understand.

View Video

Rootly

Read more about The End of "Good Code"? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

The Answer to SRE Agent Failures: Context Engineering

Sep 9, 2025 By Mezmo In Mezmo

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

Read Post

Mezmo

Read more about The Answer to SRE Agent Failures: Context Engineering

The Art of Incident Management #sre

Sep 9, 2025 By Rootly In Rootly

Read our post: https://rootly.com/blog/the-art-of-incident-management-part-i

View Video

Rootly

Read more about The Art of Incident Management #sre

Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Sep 8, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Kubernetes Monitoring Metrics That Improve Cluster Reliability

Sep 5, 2025 By Anjali Udasi In Last9

A Kubernetes cluster can generate more than 1,400 metrics out of the box. That’s a lot of numbers to sift through, especially when you’re troubleshooting a production slowdown in the middle of the night. The key is knowing which metrics tell you the most, with the least noise. These are the signals worth paying attention to when you need answers fast.

Read Post