Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Distributed Tracing and related technologies.

How Slack Transformed Their CI With Tracing

Slack experienced meteoric growth between 2017 and 2020—but that level of growth came with growing pains. In his talk at the 2021 o11ycon+hnycon, Frank Chen (LinkedIn), a Slack Senior Staff Engineer, detailed one of Slack’s biggest pain points in that period: flaky tests. A flaky test returns both a passing and failing result despite no changes in the code. At one point, between 2017 and 2020, Slack’s flaky test rate reached as high as 50%.

What's new in Grafana Cloud for July 2021: Traces, live streaming, Kubernetes and Docker integrations, and more

If you’re not already familiar with it, Grafana Cloud is the easiest way to get started observing metrics (Prometheus and Graphite), logs (Grafana Loki), traces (Grafana Tempo), and dashboards. Here are the latest features you should know about!

Distributed Tracing for Kafka Clients with OpenTelemetry and Splunk APM

This blog series is focused on observability into Kafka based applications. In the previous blogs, we discussed the key performance metrics to monitor different Kafka components in "Monitoring Kafka Performance with Splunk" and how to collect performance metrics using OpenTelemetry in "Collecting Kafka Performance Metrics with OpenTelemetry." In this blog, we'll cover how to enable distributed tracing for Kafka clients with OpenTelemetry and Splunk APM.

OpenTelemetry, Not Just for Production Troubleshooting

OpenTelemetry, Not Just for Production Troubleshooting: How to Prevent Downtime as Early as Local Dev OpenTelemetry is a great tool for observability and debugging in production. It provides you with data that empowers understanding of what is slow or broken, as well as what you can do to fix problems that occur in production. But what if you could leverage those same OpenTelemetry capabilities in pre-production? What if you could use those capabilities during development and testing phases to proactively prevent downtime in production?

Conditional Distributed Tracing

Distributed tracing is generally a binary affair—it's off or on. Either a trace is sampled or, according to a flag, it's not. Span placement is also assumed to be an "always-on" system where spans are always added if the trace is active. For general availability and service-level objectives, this is usually good enough. But when we encounter problems, we need more. In this talk, I'll show you how to "turn up the dial" with detailed diagnostic spans and span events that are inserted using dynamic conditions.

Instrumenting Java Applications for Tracing with OpenTelemetry and Jaeger

The aim of this article is to demonstrate how you can instrument a Java application using Opentelementry and Jaeger. In this example, we will be instrumenting our Java application using OpenTelemetry and the OpenTelemetry Java client, and the tracing data will be exported and visualized using Jaeger. We will use the Logz.io Jaeger backend as it is compatible with common tracing standards like Zipkin, OpenTelemetry, and OpenTracing.

Real-time distributed tracing for Go and Java Lambda Functions

Serverless applications streamline development by allowing you to focus on writing and deploying code rather than managing and provisioning infrastructure. To help you monitor the performance of your serverless applications, last year we released distributed tracing for AWS Lambda to provide comprehensive visibility across your serverless applications.

Instrumenting Microservices with Istio for Distributed Tracing

Previously, I wrote a Beginner’s Guide to Jaeger + OpenTracing Instrumentation for Go providing guidance on manually instrumenting Go services. This is useful for cases where we want fine-grained tracing of specific functions. However, what if all we want is to trace a service’s inbound and outbound calls with little to no additional code?

Find the Root Cause Faster with Trace View and Trace Navigator

Like a bratty teenager, traditional monitoring answers your questions, but does so in a terse, unhelpful manner: Why is my page slow? Guess it’s the API call. It’s a 504 thing — you wouldn’t understand. Ok, so why is the API call slow? Ask your DB query. Gosh! You need a better conversation with your code — one which gives you contextual clues about your application’s performance.