Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

New Relic's CCU-based pricing is creating unpredictable costs, pushing teams to sample heavily

We talked to 7 companies in August 2025 who were looking to switch from New Relic. One engineering director said they're paying $1,000 a month and only ingesting 10% of their traces. Teams are defaulting to aggressive sampling, some at 1%, others at 10%, to manage costs.

Zooplus Found Faster Root Cause Detection with Elastic Observability

Zooplus Platform Engineering Lead Aram Hakobayan shares how Elastic Observability helps manage 3,000+ microservices and 15,000+ logs/sec across their AWS cloud. Learn how Elastic powers their French market, centralizes monitoring, simplifies root cause analysis, and avoids costly vendor migration. Ideal for DevOps, SREs, and cloud architects scaling fast.

Sentry AI code review, now in beta: break production less

This could’ve been prevented. This should have been prevented. This too. We all hate getting tagged in PRs. The time, the blame for when you inevitably miss something, and constant “I wouldn’t have written it that way” feeling is just hard to shake. LLMs promised this would get easier. Promised they would do it for us. But as we’ve seen, we’re not there yet. But this is what Sentry does for a living. We catch bugs… in prod.

Key APM Metrics You Must Track

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.

Memory stall: the agony before OOM

When we set a memory limit for a container, the expectation is simple: if the app leaks memory, the OOM killer steps in, the container dies, Kubernetes restarts it, done. But reality is messier. As a container gets close to its memory limit, allocations don’t just fail instantly. They get slower. The kernel tries to reclaim memory inside the cgroup, and that takes time. Instead of being killed right away, your app just crawls.

Building Real-Time Data Pipelines with Kafka, Telegraf, and InfluxDB 3

When milliseconds matter and data never stops flowing, you need a pipeline that can handle high-velocity streaming data with reliability and scale. The modern streaming stack of Kafka, Telegraf, and InfluxDB 3 Core delivers exactly that. To give you a concrete example, this blog works with a fictitious use case: “Papa Giuseppe’s Pizzeria.” Every oven, prep station, and order in this pizza restaurant generates data. Our workflow looks like this.

Beyond Automation: The Rise of Agentic Networks

Agentic AI is the next evolution in network management, moving beyond simple automation to intelligent systems that can reason, plan, and act autonomously. Justin Ryburn, Kentik Field CTO, highlights how this shift automates expertise, enables proactive problem-solving, and empowers human engineers for strategic innovation.