Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

The Incident Retrospective Ground Rules

I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycomb’s incident retrospective process from the inside. I wasn’t disappointed! The first retrospective I took part in was for our ingestion delays incident on September 8th.

Scaling Ingest With Ingest Telemetry

With the introduction of Environments & Services, we’ve seen a dramatic increase in the creation of new datasets. These new datasets are smaller than ones created with Honeycomb Classic, where customers would typically place all of their services under a single, large dataset. This change has presented some interesting scaling challenges, which I’ll detail in this post, along with the solution we used, and how we leveraged Honeycomb’s own telemetry to scale Honeycomb.

Customer Story: Intercom Reduces MTTWTF With Observability and Distributed Tracing

Intercom’s mission is to build better communication between businesses and their customers. With that in mind, they began their journey away from metrics alone and towards complete observability. The first step was tooling, and they learned quickly that trying to work with multiple solutions was not the answer.

Announcing New CircleCI + Honeycomb Integration Guide

If you’re writing software today, then you likely use a CI/CD pipeline to build and test your code before deploying it to production. Having a fast and efficient build pipeline saves you development time, shortens feedback loops, and helps you ship features faster. Conversely, slow and unreliable build pipelines are full of lost productivity and sadness.

Touching Grass With SLOs

One of the things that struck me upon joining Honeycomb was the seemingly laissez-faire approach we took towards internal SLOs. From my own research (beginning with the classic SRE book, following Google’s example), I came to these conclusions: If you read the original SRE book when it was released, before the workbook came out, these conclusions all made sense.

Monitoring Cloud Database Costs with OpenTelemetry and Honeycomb

In the last few years, the usage of databases that charge by request, query, or insert—rather than by provisioned compute infrastructure (e.g., CPU, RAM, etc.)—has grown significantly. They’re popular for a lot of the same reasons that serverless compute functions are, as the cost will scale with your usage. No one is using your site? No problem: you’re not charged.

On Building a Platform Team

It may surprise you to hear, but Honeycomb doesn’t currently have a platform team. We have a platform org, and my title is Director of Platform Engineering. We have engineers doing platform work. And, we even have an SRE team and a core services team. But a platform team? Nope. I’ve been thinking about what it might mean to build a platform team up from scratch—a situation some of you may also be in—and it led me to asking crucial questions. What should such a team own?

5-Star OTel: OpenTelemetry Best Practices

Written by Liz Fong-Jones and Phillip Carter. OpenTelemetry, also known as OTel, is a CNCF open standard that enables distributed tracing and metrics collection from your applications. At Honeycomb, we believe that OpenTelemetry is the best way to ingest the high-cardinality and high-dimensional data that every system, no matter how complex or distributed, needs for observability.

New Honeycomb Features Raise the Bar for What Observability Should Do for You

As long as humans have written software, we’ve needed to understand why our expectations (the logic we thought we wrote) don’t match reality (the logic being executed). To that end, we developed techniques to help measure reality—logging text strings, or capturing aggregated metrics—and persevered, seeking out newer and fancier logging or monitoring solutions over the intervening decades.