Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on API Development, Management, Monitoring, and related technologies.

Is OpenTelemetry overkill? There's a lazier (and better) way. #speedscale #sre #ebpf #kubernetes

If you "aspire to be lazy" like we do, you know that building staging environments and mocking complex back-ends (like MySQL, AI models, and 3rd party APIs) is a massive time sink. In this demo, we show you how to use Internet Magic (aka eBPF) to: Stay tuned for Part 2, where we take these recordings and spin up a staging environment automatically.

API Latency Monitoring: Metrics, Percentiles, and Alerting Best Practices

APIs power modern applications. Every login request, product search, payment authorization, and mobile app refresh depends on an API responding quickly and reliably. When latency increases, users feel it immediately. Pages stall. Transactions hang. Confidence drops. Most engineering teams measure API latency. Fewer truly monitor it. There is a difference. Many teams track average latency in dashboards and assume performance is healthy.

API Endpoint Monitoring: How to Ensure Reliability, Performance & Functional Accuracy

APIs sit at the core of modern digital infrastructure. From e-commerce checkouts and payment processing to SaaS platforms and mobile applications, APIs move the data that keeps systems running. But APIs do not operate as a single unit. They are made up of individual endpoints, and each endpoint represents a specific function or resource that users depend on. As organizations shift toward microservices, cloud native applications, and third party integrations, the number of endpoints increases rapidly.

AI Coding Agents Break What Works

Your AI coding agent just made every test pass. Ship it, right? Not so fast. A growing class of AI-generated bugs doesn’t come from writing bad code. It comes from the AI changing working code to accommodate its own mistakes. This isn’t a theoretical risk. It’s happening now, in production codebases, and it’s harder to catch than any bug the AI might introduce from scratch.

The 4 Golden Signals of Monitoring Explained

As a team, we have spent many years troubleshooting performance problems in production systems. Applications have become so complex that you need a standard methodology to understand performance. Our approach to this problem is called the Golden Signals. By measuring these signals and paying very close attention to these four key metrics, providers can simplify even the most complex systems into an understandable corpus of services and systems.

Enhancing our API for better agentic consumption

AI coding agents like Claude Code and Codex are becoming a real part of developer workflows. They don't just write code, they call APIs, interpret responses, and take action based on what they find. That means the quality of your API responses directly affects how useful an agent can be. We've shipped a series of improvements to the Oh Dear API with this in mind. Every change helps humans too, but we specifically optimized for how agents consume and reason about data.

The Observability Gap: Why Monitoring Data Should Drive Tests

Most teams already know a lot about production. They have dashboards. They have traces. They have alerts. They have enough telemetry to explain what happened after an incident and enough graphs to argue about it for the rest of the week. Then they go to test a change and start from scratch. The integration tests hit a hand-written mock that returns {"status": "ok"}. The load tests replay a CSV somebody exported months ago. Staging is close enough to production right up until it matters.

Automate Your Monitoring and Incident Handling: How Agents Dominate the Checkly CLI

50% of Checkly's CLI users are already coding agents. We predict that agents will become dominant by the end of 2026. This video demonstrates an agentic workflow where an alert reports a broken Shopify store login flow, and Claude Code, using the installed Checkly Skill and the Checkly CLI, pulls monitoring results, identifies a Playwright test failure, investigates the codebase, finds and fixes a bug, and then updates a Checkly status page by creating an incident.

Checkly and the Agentic Software Layer

November 24th, the Opus 4.5 release turned around the entire tech industry. This was the moment when agents became capable. Capable enough to write solid staff-level code. Capable enough to reason about alerts, investigate root causes much faster than most engineers, and set up the reliability layer faster. For me, this feels like an iPhone moment on steroids; the adoption of AI is accelerating much faster than any adoption curve I’ve seen over the past few decades.