%term

The latest News and Information on Service Reliability Engineering and related technologies.

Why API Reliability Is Critical to Modern Finance

Jun 17, 2026 By OpsMatters In OpsMatters

Financial APIs power payments, compliance, and customer services. Learn why observability, monitoring, and API reliability are vital to resilience.

Read Post

OpsMatters

Read more about Why API Reliability Is Critical to Modern Finance

ClickHouse LowCardinality: When It Helps and When It Hurts

Jun 15, 2026 By Prathamesh Sonpatki In Last9

ClickHouse LowCardinality cuts storage and speeds up queries on low-cardinality columns, but backfires on trace IDs. How to tell the difference. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about ClickHouse LowCardinality: When It Helps and When It Hurts

Introducing the Rootly Agent

Jun 11, 2026 By Rootly In Rootly

During an incident, ask the Rootly Agent anything and it'll respond (and act) based on context and your data. Use the Rootly Agent to: The Rootly Agent performs actions on your behalf, so it is bound by the permissions assigned to your user. It will also ask for confirmation before taking significant actions. Rootly admins can turn it on for their workplaces and start running incidents even more efficiently.

View Video

Rootly

Read more about Introducing the Rootly Agent

Should platform, SRE, and security merge into one function?

Jun 4, 2026 By Cristina Buenahora In Cortex

Platform, SRE, and security are three distinct functions in modern engineering orgs, each shaped by a different problem. SRE was the operations function's answer to scale: how to keep systems reliable when the systems get big. Platform answered a different problem: how to let developers ship without becoming infrastructure experts. Security drew the line on what could safely reach production.

Read Post

Cortex

Read more about Should platform, SRE, and security merge into one function?

Best Log Management Software for DevOps and SRE Teams in 2026: Feature and Cost Breakdown

Jun 3, 2026 By Libi Michelson In logz.io

TL;DR Picking the right log management platform in 2026 comes down to three things: how much operational overhead you can absorb, how much AI automation you need, and what you’re willing to spend.

Read Post

logz.io

Read more about Best Log Management Software for DevOps and SRE Teams in 2026: Feature and Cost Breakdown

Running AI at Enterprise Scale

Jun 3, 2026 By Rootly In Rootly

Panel with: Moderated by Ian Sinnott, Member of Technical Staff at Anthropic.

View Video

Rootly

Read more about Running AI at Enterprise Scale

AI in SRE: Where and how Google is deploying agentic AI to improve operations

May 29, 2026 By Stevan Malesevic In Google Operations

With SRE AI, Google plans to fully adopt AI and agentic technologies, leveraging AI as a force multiplier while also maintaining control.

Read Post

Google Operations

Read more about AI in SRE: Where and how Google is deploying agentic AI to improve operations

Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

May 28, 2026 By Rootly In Rootly

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of Uptime Labs, and former Head of Reliability Engineering at IG Group, has spent two decades watching engineers learn incident response the hard way: alone, under pressure, with no training.

View Video

Rootly

Read more about Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

AI SRE Agent: How Autonomous Incident Investigation Is Eliminating Manual Root Cause Analysis

May 27, 2026 By Mohana Ayeswariya J In Atatus

A critical production alert wakes you up: p99 latency just hit 4 seconds. You drag yourself to a terminal, open five dashboards, start correlating log timestamps with trace IDs, dig through 47,000 log lines across eight services, and 90 minutes later, you finally find the culprit: an N+1 database query introduced in a deployment that shipped four minutes before the spike started. An Atatus AI SRE Agent would have identified that root cause and drafted a remediation plan in 28 seconds. Not approximation.

Read Post

Atatus

Read more about AI SRE Agent: How Autonomous Incident Investigation Is Eliminating Manual Root Cause Analysis

Error Budget in SRE: The Complete Guide (2026)

May 20, 2026 By Nuno Tomas In isDown

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment. The formula is blunt: Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration For a 30-day window: That last number should make you uncomfortable.

Read Post