%term

The latest News and Information on Service Reliability Engineering and related technologies.

How Rootly works with Slack | An end-to-end demo.

Nov 9, 2025 By Rootly In Rootly

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.

View Video

Rootly

Read more about How Rootly works with Slack | An end-to-end demo.

How Prometheus Exporters Work With OpenTelemetry

Nov 6, 2025 By Anjali Udasi In Last9

Running distributed systems means you need clear visibility into how your services behave. Prometheus has been the standard for metrics for a long time, and OpenTelemetry is now giving teams a more consistent way to collect telemetry across their stack. In many setups, you'll have both: existing Prometheus instrumentation that's already in place, and new components instrumented with OpenTelemetry.

Read Post

Last9

Read more about How Prometheus Exporters Work With OpenTelemetry

Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

Nov 6, 2025 By Datadog In Datadog

Get a first look at Datadog’s biggest product reveals from DASH 2025. Meet Bits AI SRE, your 24/7 autonomous AI Site Reliability Engineer, Flex Frozen for up to 7 years of managed log retention, and GPU Monitoring for full visibility into your AI workloads. Experience the future of observability in action.

View Video

Datadog

Read more about Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

What Are AI Guardrails

Nov 5, 2025 By Anjali Udasi In Last9

When you're shipping LLM features, a lot of the work goes into keeping the model's behavior predictable. You deal with questions like: These are everyday concerns when you integrate LLMs into production systems. Guardrails AI provides a Python framework that helps you enforce those expectations. You define the schema or constraints you need, and the framework validates both the inputs going into the model and the outputs coming back.

Read Post

Last9

Read more about What Are AI Guardrails

Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

Nov 5, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we look at AIOps, where it fell short, where it worked, and how generative AI (GenAI) is reshaping what’s possible in observability today. We explore: If you’re wondering whether generative AI is different this time, this episode offers a grounded, practical look at how it’s evolving observability workflows.

View Video

Elastic

Read more about Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Nov 5, 2025 By Rootly In Rootly

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.

View Video

Rootly

Read more about You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Grafana Tempo: Setup, Configuration, and Best Practices

Nov 4, 2025 By Anjali Udasi In Last9

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.

Read Post

Last9

Read more about Grafana Tempo: Setup, Configuration, and Best Practices

SRE vs DevOps vs Platform Engineering: What Are the Key Differences

Nov 4, 2025 By Randhir Kumar In Spike

Software delivery is more complex than ever. Teams need speed, reliability, and scalability to stay competitive. Site Reliability Engineering (SRE), DevOps, and Platform Engineering are three key disciplines that address these challenges. Though these terms are often used together, they are not the same and share distinct differences. In this blog, we’ll discuss each term individually, compare SRE vs. DevOps vs. Platform Engineering, and also show how they work together.

Read Post

Spike

Read more about SRE vs DevOps vs Platform Engineering: What Are the Key Differences

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Nov 3, 2025 By Anjali Udasi In Last9

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.

Read Post

Last9

Read more about OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Embracing failure and chaos to improve system reliability and SRE team performance

Nov 3, 2025 By Elastic In Elastic

In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability. Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.

View Video