Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector

Missed it live? Catch the full recording of OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector — a 1-hour workshop on building telemetry pipelines that never drop a signal. We’ll show you why resilience matters, how to design high-availability architectures, and how to configure the OpenTelemetry Collector with retries, batching, and persistent queues. Plus, you’ll see live demos in both Docker and Kubernetes — including scaling Gateway collectors with an HPA — and how Bindplane makes large-scale management seamless.

Reality Bytes #62: Digital Overload - The Distraction Episode

In this episode of Reality Bytes, Sean, Oriana, Dina, Tim, and Tom dive into the ever-present challenge of digital distraction in the workplace. From smartphones and smartwatches to endless Teams notifications, our panelists share their personal "kryptonite" when it comes to staying focused.

How should Prometheus handle OpenTelemetry resource attributes?

Note: A version of this post originally appeared on the OpenTelemetry blog. Victoria Nduka is user experience designer and open source contributor making her way into the cloud native space. She writes about design, accessibility, and open source with the same curiosity she brings to her work. On May 29, I wrapped up my mentorship with Prometheus through the Linux Foundation mentorship program.

The core KPIs of LLM performance (and how to track them)

A few months ago, I built an MCP server for Toronto’s Open Data portal so an agent could fetch datasets relevant to a user’s question. I threw the first version together, skimmed the code, and everything looked fine. Then I asked Claude: “What are all the traffic-related data sources for the city of Toronto?” The tool call fired. I got relevant results. And then I hit an error: “Conversation is too long, please start a new conversation.” I had only asked one question.

Understanding Incident Response vs Incident Remediation

At a high level, incident remediation is a part of the incident response process. An Incident response plan manages the incident lifecycle across planning, detection, investigation, and recovery. Meanwhile, incident remediation focuses on identifying root causes and implementing measures to prevent future occurrences.