%term

The latest News and Information on Service Reliability Engineering and related technologies.

Docker Status Unhealthy: What It Means and How to Fix It

Jul 4, 2025 By Faiz Shaikh In Last9

If your container shows Status: unhealthy, Docker's health check is failing. The container is still running, but something inside, usually your app, isn’t responding as expected. This doesn’t always mean a crash. It just means Docker can’t verify the app is working. Here’s how to debug the issue and restore the container to a healthy state.

Read Post

Last9

Read more about Docker Status Unhealthy: What It Means and How to Fix It

LangChain Observability: From Zero to Production in 10 Minutes

Jul 3, 2025 By Anjali Udasi In Last9

LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork. In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith. With this setup, you’ll be able to.

Read Post

Last9

Read more about LangChain Observability: From Zero to Production in 10 Minutes

Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)

Jul 3, 2025 By Rootly In Rootly

Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto markets. Brian shares how unexpected market events can create massive traffic spikes, how their platform architecture and Kubernetes setup help them stay resilient, and why Uphold's transparency and regulatory approach make them both trustworthy and a high-profile target.

View Video

Rootly

Read more about Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)

LangChain & LangGraph: The Frameworks Powering Production AI Agents

Jul 2, 2025 By Anjali Udasi In Last9

Your AI agent worked flawlessly in development, with fast responses, clean tool use, and nothing out of place. Then it hit production. A simple "What's our pricing?" query triggered six API calls, took 8 seconds, and returned the wrong answer. No errors. No stack traces. Unlike traditional systems, AI agents don't crash, they drift. They make poor decisions quietly, and your monitoring says everything's fine.

Read Post

Last9

Read more about LangChain & LangGraph: The Frameworks Powering Production AI Agents

How to Run Elasticsearch on Kubernetes

Jul 2, 2025 By Anjali Udasi In Last9

Elasticsearch stands as one of the most robust open-source search engines available today. Built on Apache Lucene, it handles complex search operations, real-time analytics, and large-scale data processing with impressive speed and accuracy. Kubernetes has transformed how we deploy and manage containerized applications. This orchestration platform automates deployment, scaling, and operations of application containers across clusters of hosts.

Read Post

Last9

Read more about How to Run Elasticsearch on Kubernetes

Logging in Docker Swarm: Visibility Across Distributed Services

Jul 1, 2025 By Faiz Shaikh In Last9

Docker Swarm's logging model shifts from individual container logs to service-level aggregation. The docker service logs command batch-retrieves logs present at the time of execution, pulling data from all containers that belong to a service across your cluster. This approach gives you a unified view of distributed applications, but it comes with its patterns and considerations for effective observability.

Read Post

Last9

Read more about Logging in Docker Swarm: Visibility Across Distributed Services

How to Write Logs to a File in Go

Jul 1, 2025 By Anjali Udasi In Last9

When your Go application moves beyond development, you need structured logging that persists. Writing logs to files gives you the control and reliability that stdout can't match, especially when you're debugging production issues or need to meet compliance requirements. This blog walks through the practical approaches, from Go's standard library to structured logging with popular packages.

Read Post

Last9

Read more about How to Write Logs to a File in Go

Prometheus Gauges vs Counters: What to Use and When

Jun 30, 2025 By Anjali Udasi In Last9

Choosing the wrong metric type in Prometheus can lead to inaccurate dashboards, false positives in alerting, and missed indicators of system failure. Gauge metrics are intended for tracking values that can go up and down, such as memory usage, queue depth, or the number of active connections. Unlike counters, which only increment (or reset on restart), gauges reflect the current state of a resource at scrape time.

Read Post

Last9

Read more about Prometheus Gauges vs Counters: What to Use and When

Monitoring Behind the Great Firewall

Jun 27, 2025 By Dotcom-Monitor In Dotcom-Monitor

As Site Reliability Engineers (SREs) managing global infrastructure, we face unique challenges when serving users in mainland China. The Great Firewall of China creates a complex web of technical obstacles that can render even the most robust international websites slow, unreliable, or completely inaccessible to Chinese users.

Read Post

Dotcom-Monitor

Read more about Monitoring Behind the Great Firewall

Prometheus and CloudWatch Integration for AWS Metric Collection

Jun 26, 2025 By Anjali Udasi In Last9

The Prometheus CloudWatch exporter pulls AWS CloudWatch metrics into your Prometheus setup, giving you a unified view of your infrastructure alongside application metrics. If you're already running Prometheus and need visibility into AWS services like EC2, RDS, or Lambda, this exporter handles the integration without forcing you to switch monitoring stacks.

Read Post