Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Observing agentic AI workflows with Grafana Cloud, OpenTelemetry, and the OpenAI Agents SDK

As agentic AI applications are used more broadly in production, they introduce new operational models, combining multi-step reasoning, tool execution, and autonomous decision-making into a single workflow. SRE teams need visibility into how these agents behave, where they fail, and how they perform over time.

Monitoring Sprawl: Why IT Teams Still Can't Get Actionable Insight Fast

IT teams collect extensive monitoring data but struggle to turn it into fast, confident decisions during incidents. Most IT leaders aren’t worried about whether their environments are monitored—they’re worried about whether their teams can make sense of what they’re seeing quickly enough to actually resolve issues. When something breaks, the problem usually isn’t finding data. Dashboards show activity, alerts indicate changes, and logs capture events across the entire stack.

AI Agent Governance: How to Keep Agentic ITOps Workflows Safe

The future of ITOps automation is better control over what AI agents can see, share, and do. AI automation in ITOps is expected to resolve incidents, reduce operational load, and operate with limited human involvement. Those outcomes depend on systems that can take action, not just surface insight. Agentic AI enables that shift. AI agents can correlate signals across tools, update tickets, trigger remediation, and coordinate workflows without waiting for instruction.

Make faster, better product decisions with Datadog Product Analytics

Product managers (PMs) need to make fast, confident decisions about what to build, fix, and improve based on user behavior within their application. But in practice, collecting the user insights they require is rarely straightforward. Recent updates to Datadog Product Analytics address this challenge. Product Analytics adds structure to autocaptured data and makes analysis easier to interpret, reuse, and share, helping PMs move from questions to answers without relying on SQL or engineering.

Surface and remediate runtime posture issues with Workload Protection Findings

Threat detection and runtime posture monitoring are related but different jobs. Security teams already rely on Datadog Workload Protection to detect threats in real time across hosts and containers. But the actions that lead to those detections (file manipulation, process execution, network calls, or kernel activity) can be indicative of compromise or simply of risky behavior—like running compilers in production containers.

Alert Noise Isn't an Accident - It's a Design Decision

In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate. They add process. They add people. They add noise. Alerting is one of the most visible places where this shows up.

The Grok-to-AI Evolution: Why Modern SREs Are Moving Beyond Manual Parsing

Grok structures logs. Context engineering connects systems. AI explains behavior. For years, Grok patterns have been the workhorse of the SRE world. Built on regular expressions, Grok helps teams extract structure from unstructured logs. As we explored in "Do You Grok It?", Grok is the key to turning messy log lines into usable fields. It's why our Grok Pattern Reference remains one of our most-visited resources — SREs are hungry for structure.

ISO 27K Without the Bloat: An Open Source Approach

It’s often framed as an enterprise-only exercise: long timelines, expensive tooling, consultants everywhere, and a lot of compliance work that exists mainly to survive an audit. As a ~40-person, engineering-driven SaaS company, we needed the same level of trust and rigor as much larger organizations — but we weren’t willing to accept shelfware, parallel compliance infrastructure, or controls that only exist on paper. We also didn’t stop at ISO 27001.