Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

What is Grafana Cloud? Fully Managed Observability Built on Open Standards | Grafana Labs

Grafana Cloud helps teams detect, investigate, and resolve incidents faster—thanks to AI, open standards, and seamless integrations with OpenTelemetry, Prometheus, Salesforce, and more. See how it all works in this live demo of a simulated e-commerce outage.

Building an Effective Post-Mortem Culture: A Step-by-Step Guide

Post-mortems are the cornerstone of continuous improvement in incident management. When done right, they transform failures into learning opportunities and prevent future outages. Yet many teams struggle to build a culture where post-mortems are valued rather than feared.

From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace

It is 12PM and you just start eating lunch when your phone starts buzzing. A storm of different monitoring and system-level alerts start stacking up on your phone and slack. The incident response "war room" opens and downtime communications are being drafted to customers. Your team is under pressure to find the root cause, but you are immediately hit with roadblocks.

Incident IQ integration is here!

We’re excited to launch one of our most highly requested integrations: StatusGator now connects directly with Incident IQ. This powerful new integration bridges the gap between real-time service monitoring and your internal support workflow. Now, whenever someone reports an outage on your public StatusGator page, a ticket is automatically created in Incident IQ—ensuring your IT team can respond quickly and efficiently.

Don't fly blind... monitor from your users' perspective.

Most monitoring strategies focus only on what happens inside their applications... but that’s not what your users experience. From your backend to the cloud, through third-party APIs, DNS, CDNs, ISPs, and finally to the user’s device, every link in the chain matters. Without that visibility, you're flying blind when something breaks in your Internet Stack. Catchpoint’s 3,000+ intelligent agents across 100+ countries deliver true end-to-end visibility, capturing every hop, every variable, and every moment of user impact.

Evals are just tests, so why aren't engineers writing them?

You’ve shipped an AI feature. Prompts are tuned, models wired up, everything looks solid in local testing. But in production, things fall apart—responses are inconsistent, quality drops, weird edge cases appear out of nowhere. You set up evals to improve quality and consistency. You use Langfuse, Braintrust, Promptfoo—whatever fits. You start running your evals, tracking regressions, fixing issues, and confidence goes up as a result. Things improve.

New in APM

Datadog’s Latency Investigator for APM—now in Preview—automatically investigates hypotheses in the background, comparing historical traces and correlating change tracking, DBM, and profiling signals. This helps teams quickly isolate root causes and understand impact without combing through raw telemetry data. You can go from detection to resolution in a single workflow, and generate a pull request to apply a recommended fix, all without leaving Datadog..

AI Agents Console: Monitor the behavior and interactions of any AI agent in your stack

With Datadog's AI Agents Console, you can monitor the behavior and interactions of any AI agent that’s a part of your enterprise stack, whether that’s a computer use agent like OpenAI’s Operator, IDE agent like Cursor, DevOps agent like Github Copilot, enterprise business agent like Agentforce, or your internally built agents. You'll have full visibility into every agent's actions, insights into the security and performance of your agents, analytics on user engagement, and measurable business value from every agent, all in a centralized location.