%term

The latest News and Information on Service Reliability Engineering and related technologies.

How it feels to run an incident with AI SRE

Apr 23, 2026 By Article In Incident.io

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

Read Post

Incident.io

Read more about How it feels to run an incident with AI SRE

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Apr 23, 2026 By Prathamesh Sonpatki In Last9

Your SLI query shows 100% availability as No Data. Here's why PromQL returns empty results instead of zero — and the label-preserving fix. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

Apr 21, 2026 By Prathamesh Sonpatki In Last9

WordPress powers 40% of the web but has no native observability story. Here's how to instrument it end-to-end with OpenTelemetry - PHP, browser RUM, and errors. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

10,000 GPUs, One TSDB: Cardinality at GPU Scale

Apr 21, 2026 By Shekhar In Last9

1,000 nodes × 8 GPUs × 60 metrics = 1.4M time series - before you add pod names or Slurm job IDs. GPU monitoring is a cardinality problem disguised as a metrics problem. How to design for it before production OOMs your Prometheus.

Read Post

Last9

Read more about 10,000 GPUs, One TSDB: Cardinality at GPU Scale

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

Apr 21, 2026 By Shekhar In Last9

GPU observability isn't one thing - it's eight connected layers from silicon to cost. See why correlation across layers is what cuts debugging from 2 hours to 2 minutes, and why most teams instrument only one or two.

Read Post

Last9

Read more about From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

How to solve key site reliability engineering challenges

Apr 20, 2026 By Lightrun Team In Lightrun

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

Read Post

Lightrun

Read more about How to solve key site reliability engineering challenges

The GPU Metrics That Actually Matter

Apr 20, 2026 By Shekhar In Last9

Most teams monitor three GPU metrics - utilization, temperature, memory. There are 50+ that matter, and the ones you skip cause your worst outages. A vendor-neutral guide across NVIDIA, AMD, and Intel Gaudi.

Read Post

Last9

Read more about The GPU Metrics That Actually Matter

Your LLM Is Slower Than You Think

Apr 19, 2026 By Shekhar In Last9

60% GPU utilization and 3-second response times? GPU utilization is the wrong signal for LLM inference. Here's why TTFT, KV-cache pressure, and queue depth - not utilization - predict user-facing latency.

Read Post

Last9

Read more about Your LLM Is Slower Than You Think

Predicting GPU Failures Before They Cost You

Apr 18, 2026 By Shekhar In Last9

Predict GPU hardware failures 48–72 hours in advance. A guide to the five rate-based signals — ECC error trends, XID events, thermal ramp, row remap exhaustion, PCIe downtraining — and how to combine them into a composite health score.

Read Post

Last9

Read more about Predicting GPU Failures Before They Cost You

Every Token Has a Price: Per-Request GPU Cost Attribution

Apr 17, 2026 By Shekhar In Last9

Flat per-token pricing is wrong by 10–50× per request. Prefill vs decode, batch sharing, and cache effects break the math. How to attribute real GPU cost - compute, energy, and dollars - to each inference request.

Read Post