Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Which Observability Tool Helps with Visibility Without Overspend

If you’re trying to control observability spend without cutting visibility, the platforms that usually offer the best cost balance at enterprise scale are Last9, Grafana Cloud, Elastic, and Chronosphere — depending on the shape of your telemetry and the level of operational ownership you want.

OTel Updates: Unroll Processor Now in Collector Contrib

Some log sources bundle multiple events into a single record before shipping them. This is common with VPC flow logs, CloudWatch exports, and certain Windows endpoint collectors. While this batching approach is efficient for transport, it creates challenges when you need to filter, search, or correlate individual events. When a log record contains an array of 47 events, your analytics tool sees one entry instead of 47 distinct records.

Cost Optimization Is Now Part of the SRE Playbook

In the era of cloud-native architectures, Site Reliability Engineering (SRE) has matured from a discipline focused purely on uptime to a sophisticated practice of efficient reliability. The key driver for this evolution is an undeniable truth: cloud spend has become intrinsically linked to system stability.

Datadog Bits AI SRE: Your new teammate for on-call shifts

Bits AI SRE is an always-on SRE agent built to handle complex troubleshooting and late-night alerts. Developed against thousands of real-world incidents and powered by Datadog’s platform, Bits AI SRE analyzes your entire stack, tests hypotheses, and identifies root causes in minutes. Resolve faster, get back to sleep sooner, and give your on-call team the confidence and capacity they need.

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

The observability market has moved beyond manual threshold-setting. Modern platforms use statistical algorithms, machine learning, and causal AI to detect anomalies automatically. Some work immediately after deployment. Others train on your data for better accuracy. Each approach has technical trade-offs worth understanding. This guide compares how nine monitoring solutions handle automated anomaly detection and root cause analysis.

Instrument Jenkins With OpenTelemetry

You can instrument Jenkins with OpenTelemetry using the official plugin and an OpenTelemetry Collector, then send the data to a backend like Last9 to understand where pipeline latency and failures actually originate. Jenkins provides job status and console logs, but it doesn't show how time is distributed across stages, agents, plugins, and external systems. OpenTelemetry fills that gap by emitting traces, metrics, and logs in a standard format that any OTLP-compatible backend can process.