Latest Blogs

How it feels to run an incident with AI SRE

Apr 23, 2026 By Article In Incident.io

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

Read Post

Incident.io

Read more about How it feels to run an incident with AI SRE

How Recurring Instability Turns into Clinical Trial Delays

Apr 23, 2026 By Chanté Frazer In Nexthink

In pharma, reliability becomes an operational priority because research and trial work depend on systems performing consistently across different teams, locations, and conditions. Much of that work sits inside scientific workflows, remote sessions, and compute-heavy environments where behaviour can shift with configuration or load. When that consistency starts to break down, teams keep moving, but time is lost in small increments across the day.

Read Post

Nexthink

Read more about How Recurring Instability Turns into Clinical Trial Delays

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Apr 23, 2026 By Prathamesh Sonpatki In Last9

Your SLI query shows 100% availability as No Data. Here's why PromQL returns empty results instead of zero — and the label-preserving fix. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

The data context gap: why agents fail on fragmented stacks

Apr 23, 2026 By Upsun In Upsun

Key takeaway: AI agents and RAG pipelines only reach production-grade accuracy when they are developed against byte-level clones of real production data. Without environment parity, the "repro gap" leads to inevitable AI failure.

Read Post

Upsun

Read more about The data context gap: why agents fail on fragmented stacks

Take Control of Cloud Costs with Proactive Budget Alerts

Apr 23, 2026 By Teia Jensen In LogicMonitor

Proactive budget alerts turn cloud cost optimization into an everyday operational practice. If you are responsible for managing cloud infrastructure, you already know the pattern. Costs creep up quietly, and by the time anyone notices, it is the end of the month and you are explaining instead of preventing overruns. According to Flexera’s 2026 State of the Cloud Report, 85% of their respondents say managing cloud costs is their number one priority for the year.

Read Post

LogicMonitor

Read more about Take Control of Cloud Costs with Proactive Budget Alerts

VictoriaMetrics at KubeCon Amsterdam: Community Highlights

Apr 23, 2026 By Diana Todea In VictoriaMetrics

KubeCon + CloudNativeCon Europe in Amsterdam brought together about 13,500 attendees this year, the largest turnout yet. The size of the event showed just how much the cloud-native space has grown, and how central observability, platform engineering, and cost control have become. For VictoriaMetrics, this year’s event was a mix of talks, booth conversations, and a lot of direct feedback from users.

Read Post

VictoriaMetrics

Read more about VictoriaMetrics at KubeCon Amsterdam: Community Highlights

What's new in VictoriaMetrics Anomaly Detection (Q1 2026)

Apr 23, 2026 By Fred Navruzov In VictoriaMetrics

Following our 2025 updates, here we recap how VictoriaMetrics Anomaly Detection evolved in Q1 2026. Stay tuned for upcoming content on anomaly detection.

Read Post

VictoriaMetrics

Read more about What's new in VictoriaMetrics Anomaly Detection (Q1 2026)

Managing OpenTelemetry Semantic Convention Migrations With the Collector

Apr 23, 2026 By Mike Goldsmith In Honeycomb

Real production data tells the story better than I can. Juraci Paixão Kröhling, a friend and fellow observability practitioner at OllyGarden, recently shared an example from an anonymized production environment: 1,830 occurrences of http.url and 23,984 occurrences of url.full in the same dataset. Both attributes describe the same thing. Both are actively being written to the same backend at the same time.

Read Post

Honeycomb

Read more about Managing OpenTelemetry Semantic Convention Migrations With the Collector

Setting the Bar for Agentic NetOps

Apr 23, 2026 By Steve Stover In Kentik

AI has quickly become part of the language of network observability. Many vendors across the observability landscape can describe, summarize, correlate, or explain some data or situation, leveraging basic LLM capabilities. At a distance, many of these offerings sound similar. They promise faster insight, efficient operations, and a more intelligent path through rising complexity. But the industry has reached a point where surface-level similarity is creating noise, not value.

Read Post

Kentik

Read more about Setting the Bar for Agentic NetOps

AI for Incident Response: Should You Build or Buy?

Apr 23, 2026 By Snir Amsalem In Komodor

SREs and platform teams are overwhelmed by the effort of manually troubleshooting ever-more complex cloud-native environments. This pain is driving a breakneck adoption of AI SRE solutions that promise to automate core reliability practices, from root cause analysis to capacity planning. For teams with strong engineering talent, creating a DIY AI SRE seems like a straightforward challenge.

Read Post

Komodor

Read more about AI for Incident Response: Should You Build or Buy?

Operations | Monitoring | ITSM | DevOps | Cloud

How it feels to run an incident with AI SRE

How Recurring Instability Turns into Clinical Trial Delays

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

The data context gap: why agents fail on fragmented stacks

Take Control of Cloud Costs with Proactive Budget Alerts

VictoriaMetrics at KubeCon Amsterdam: Community Highlights

What's new in VictoriaMetrics Anomaly Detection (Q1 2026)

Managing OpenTelemetry Semantic Convention Migrations With the Collector

Setting the Bar for Agentic NetOps

AI for Incident Response: Should You Build or Buy?

Monthly Archive

Follow Us