%term

The latest News and Information on Service Reliability Engineering and related technologies.

Risk Register for SREs: A Practical Guide to Proactive Incident Prevention

Jul 18, 2025 By Nuno Tomas In isDown

A risk register is one of the most powerful tools in an SRE's arsenal for maintaining system reliability. By systematically documenting potential threats to your infrastructure and services, you can shift from reactive firefighting to proactive risk management.

Read Post

isDown

Read more about Risk Register for SREs: A Practical Guide to Proactive Incident Prevention

Set Up ClickHouse with Docker Compose

Jul 18, 2025 By Preeti Dewani In Last9

ClickHouse is built for high-performance OLAP workloads, capable of scanning billions of rows in seconds. If your analytical queries are bottlenecked on PostgreSQL or MySQL, or you're burning too much on Elasticsearch infrastructure, ClickHouse offers a faster and more cost-efficient alternative. This blog walks through setting up ClickHouse locally with Docker Compose and scaling toward a production-grade cluster with monitoring in place.

Read Post

Last9

Read more about Set Up ClickHouse with Docker Compose

Stream AWS Metrics to Grafana with Last9 in 10 minutes

Jul 18, 2025 By Faiz Shaikh In Last9

It’s 2:47 AM and your Lambda functions are timing out. API response times are spiking. You’re flipping between the CloudWatch console, your APM tool, and your logs, trying to figure out what’s going wrong. CloudWatch has the metrics you need: CPU usage, memory pressure, and request rates — but connecting that data to what your app is doing takes time. The delay in stitching it all together slows down your incident response.

Read Post

Last9

Read more about Stream AWS Metrics to Grafana with Last9 in 10 minutes

Query and Analyze Logs Visually, Without Writing LogQL

Jul 17, 2025 By Anjali Udasi In Last9

It’s 2 AM. An incident’s in progress. Error rates are climbing. You jump into the logs, filter by service, adjust the time window… and now you need a LogQL query. You write one. It errors out. You fix the syntax, try again, only to realize you need a different filter or a new aggregation. Back to rewriting. By the time you’ve got the query right, you’ve already lost 10–15 minutes. The system is still broken, and you still don’t know why.

Read Post

Last9

Read more about Query and Analyze Logs Visually, Without Writing LogQL

Trace Go Apps Using Runtime Tracing and OpenTelemetry

Jul 17, 2025 By Preeti Dewani In Last9

When your Go service hits 500ms latencies but CPU usage is flat, tracing gives you visibility into what the profiler misses. With 1–2% runtime overhead, Go’s built-in tracing tools help you: This makes it easier to debug performance regressions that don’t leave a clear footprint.

Read Post

Last9

Read more about Trace Go Apps Using Runtime Tracing and OpenTelemetry

Build Log Automation with Last9's Query API

Jul 16, 2025 By Prathamesh Sonpatki In Last9

Manual log investigation is one of those engineering tasks that quietly drains hours without offering much real value. You're debugging an incident. Monitoring shows elevated error rates. Now begins the familiar drill: It’s a tedious cycle, and it doesn’t scale. The whole process breaks down when you’re trying to automate incident response, run continuous security monitoring, or generate compliance reports.

Read Post

Last9

Read more about Build Log Automation with Last9's Query API

Kibana Logs: Advanced Query Patterns and Visualization Techniques

Jul 16, 2025 By Anjali Udasi In Last9

Kibana gives you a structured way to explore log data indexed in Elasticsearch. With the right queries and visualizations, you can identify anomalies, debug issues more quickly, and track trends across services. This blog covers practical ways to query logs using Kibana’s Lucene and KQL syntax, build visualizations that surface meaningful signals, and set up dashboards for ongoing log-based monitoring.

Read Post

Last9

Read more about Kibana Logs: Advanced Query Patterns and Visualization Techniques

Enable Kong Gateway Tracing in 5 Minutes

Jul 16, 2025 By Anjali Udasi In Last9

Kong Gateway is a popular API gateway that sits at the edge of your infrastructure, routing and shaping traffic across microservices. It’s fast, pluggable, and battle-tested, but for many teams, it remains a black box. You might have OpenTelemetry set up across your application stack. Traces flow from your app servers, databases, and third-party APIs. But the moment a request enters through Kong, observability drops off.

Read Post

Last9

Read more about Enable Kong Gateway Tracing in 5 Minutes

Jaeger Metrics: Internal Operations and Service Performance Monitoring

Jul 15, 2025 By Faiz Shaikh In Last9

You're monitoring a microservices-based system. Alerts trigger when response times exceed 2 seconds. But when you open Jaeger, you're faced with thousands of traces. Identifying which service or operation is responsible becomes time-consuming. Jaeger metrics help reduce this friction by exposing aggregated telemetry. Instead of scanning individual traces, you get service-level and operation-level performance metrics, latency, throughput, and error rates that highlight where the issue lies.

Read Post

Last9

Read more about Jaeger Metrics: Internal Operations and Service Performance Monitoring

How to Get Grafana Iframe Embedding Right

Jul 14, 2025 By Anjali Udasi In Last9

Adding Grafana dashboards directly into your app lets users see monitoring data without switching tabs or tools. Using an iframe to embed Grafana does work, but it brings along some tricky authentication and security issues that aren’t always obvious at first. In this blog, we’ll go over the practical ways to embed Grafana dashboards from easy public snapshots to secure, private dashboards that need authentication.

Read Post