Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Top tips: Designing systems people won't work around

Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at why people bypass systems—and how better design choices can prevent it. When people work around systems, it’s tempting to blame their behavior. In reality, most employee workarounds are signals.

VirtualMetric's Hybrid Security Data Collection Architecture: Performance and Scale Without Compromise

Modern security operations face a growing architectural challenge: collect telemetry from everywhere, process it in real time, and route it to multiple platforms while maintaining data sovereignty, avoiding agent sprawl, and keeping costs under control. Single-model collection strategies force security teams to make compromises. Agent-only models create operational overhead and maintenance risk. Agentless-only approaches simplify operations but limit depth and flexibility.

Observability with AI? Honeycomb with AI!

Since Honeycomb started, it has had a weakness: too many choices. Every field, custom or standard, hundreds of them, all are free to group, filter, and visualize in dozens of ways. Which ones are interesting? Honeycomb exists to help people understand custom software. It doesn’t pretend to know what matters in your application. That’s an interpretive task, not programmatic. Hey, computers can do interpretation now!

Lightrun Runtime Context MCP | Lightrun

In this video, Lightrun's Moshe Sambol walks you through the power of Lightrun MCP and Runtime Context. A game-changer for AI-assisted development. This integration lets developers debug live issues, inspect real-world variables, and verify fixes across environments, all without leaving the IDE. With Lightrun MCP, you can: Capture live transaction state directly from Staging and Production. Identify root causes using real runtime values, not just static code. Verify fixes instantly without redeploying or context switching.

High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale

TL;DR: Prometheus pays cardinality costs at write time (memory, index). ClickHouse pays at query time (aggregation memory). Neither is "better":they fail differently. Design your pipeline knowing which failure mode you're accepting. -- Every month, someone posts "just use ClickHouse for metrics" or "Prometheus can't handle scale." Both statements contain a kernel of truth wrapped in dangerous oversimplification.

Most Popular Java Web Frameworks in 2026

Look, if you're starting a new Java web project in 2026, you should probably just use Spring Boot. With 14.7% usage in the 2025 Stack Overflow Developer Survey and a 53.7% admiration score among all web frameworks, it remains the default choice for modern Java web development. It has the largest ecosystem, best documentation, most active community, and strongest cloud-native support—now enhanced with built-in AI capabilities through Spring AI.

Major outage takes down X and Grok

On January 16, 2026 the social media platform X (formerly known as Twitter) and its AI chatbot, Grok, experienced a widespread outage affecting users around the world. This incident underscores why proactive outage detection matters. StatusGator’s Early Warning Signals spotted meaningful signs of disruption long before any official provider acknowledgment appeared publicly and helped organizations prepare or respond faster than waiting for status pages or press releases.

Verizon outage - January 14

When a major carrier like Verizon goes down, the impact is immediate and widespread. On January 14, 2026, thousands of users across the United States found themselves without cellular service, unable to make calls, send texts, or access data. While social media erupted with reports of “SOS mode” on iPhones, official acknowledgment from the provider lagged behind for hours.

New API endpoints: Pause and resume website & ping monitors

We’ve added new API capabilities that give you more control over your monitoring workflows – directly from code. You can now pause and resume website and ping monitors via the StatusGator API, exposing the same pause functionality that’s available in the UI.