Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

The Network Impact on Job Completion Time in AI Model Training

In large-scale AI model training, network performance is no longer a supporting actor — it’s center stage. Job Completion Time (JCT), the key metric for measuring training efficiency, is heavily influenced by the network interconnecting thousands of GPUs. In this post, learn why JCT matters, how microbursts and GPU synchronization delays inflate it, and how platforms like Kentik give network engineers the visibility and intelligence they need to keep training jobs on schedule.

From Anomaly to Action: ScienceLogic's Role in Accelerating Zero Trust Response

In today’s threat landscape, cyber incidents unfold in seconds, not days. Federal agencies and critical infrastructure operators no longer have the luxury of slow detection or manual triage. As Zero Trust Architecture (ZTA) becomes the new security standard, one principle stands above all: time is risk. The faster an organization can detect, diagnose, and respond to anomalous activity, the greater its resilience. ScienceLogic plays a critical role in making that speed possible.
Sponsored Post

AIOps for SAP: From Ground to Cloud

Anyone working in the SAP market in 2025 is aware of two big topics: migration to cloud-based ERP and the end of many long-used tools for managing SAP operations including Focused Run, Landscape Manager and Solution Manager. Both are impossible to ignore. Cloud-based ERP presents a new era of business software possibilities, and with it the opportunities and complexities of migration, transformation, and leveraging the elastic capacity and scalability of cloud-based designs. But right behind it, the question becomes "how are we going to run and manage this?"

Data Observability: Build confidence in the data life cycle

Datadog Data Observability provides a complete solution with quality checks (e.g., volume, row changes, freshness), custom SQL-based monitors, anomaly detection, column-level lineage across systems like Snowflake and Tableau, full pipeline visibility, and targeted alerts when data issues arise.

Disposable Code Is Here to Stay, but Durable Code Is What Runs the World

Every day I seem to run into yet another post with someone solemnly opining that “writing code has never been the hardest part of software engineering. And hey, that’s smashing. As an engineer from the ops/infra/SRE side of the house, I feel like I’ve been saying this my whole career. (Is there anything more satisfying than being proven right in public? Not in my book.) So, which is it?

Why Your Loki Metrics Are Disappearing (And How to Fix It)

Grafana Loki is up and running, log ingestion looks healthy, and dashboards are rendering without issues. But when you query logs from a few weeks ago, the data's missing. This is a recurring problem for many teams using Loki in production: while the system handles short-term log visibility well, it often lacks the retention guarantees developers expect for historical analysis and incident review.

New in OTel: Auto-Instrument Your Apps with the OTel Injector

As distributed systems scale, maintaining manual instrumentation across services quickly becomes unsustainable. The OTel Injector addresses this by automatically attaching OpenTelemetry instrumentation to applications, no code changes needed. This blog covers how the OTel Injector works, how it integrates with Linux environments, and how to set it up for consistent telemetry across your stack.