Operations | Monitoring | ITSM | DevOps | Cloud

Incident Response Reimagined: Accelerating Resolution with AI Agents

Learn how PagerDuty is leveraging Agentic AI to transform the incident lifecycle from reactive firefighting to proactive prevention. Manuel Reis, Software Developer at PagerDuty, demonstrates how new tools like the SRE Agent and Scribe Agent assist engineers during high-pressure outages by autonomously triaging alerts, querying logs in tools like Grafana, and transcribing context directly into incident channels.

Prompt, Deploy, Pray Is Dead: Validating AI Code with Proxymock

Recent outages tied to AI-assisted code changes have pushed companies into a corner. After several incidents with massive “blast radius” impacts, organizations like Amazon introduced stricter controls—mandating that senior engineers manually review all AI-generated code before it hits production. That response makes sense on paper, but it exposes a fatal flaw in the modern development pipeline.

MCP and A2A: What They Are and Why They Matter for Autonomous IT

MCP and A2A are the two protocols that make agentic AI governable at enterprise scale. One controls how agents use tools, and the other controls how agents work together. AI in the enterprise is no longer confined to chat windows. It’s operating inside incident queues and automation pipelines. Increasingly, teams are using AI agents to take action: detecting incidents, executing remediations, updating tickets, coordinating across systems.

From signals to savings: Optimizing cloud costs with Grafana Assistant and MCP servers

In today's cloud-native environments, managing resource waste and optimizing costs can feel like a constant battle. Operators, along with their fearless FinOps teams, spend countless hours hunting down unused resources, deciphering complex telemetry data, and manually implementing code or configuration changes to try to reduce cloud costs. But what if you could automate the entire process, from identifying waste to implementing the fix, all based on actual production telemetry?

EV Fleets Don't Fail on the Road. They Fail in the Workflow. Agentic AI Fixes That.

You spent the last decade obsessing over connectivity. You bought into the hype that ‘data is the new oil.’ You fitted your entire fleet with sensors and built massive dashboards to track everything from battery cell temperature to tire pressure. The mission was simple: Capture every metric. Congratulations, you succeeded. You are now drowning in terabytes of data. But here is the hard truth: Data without action is just expensive noise.

Test your AI model training reliability, too

Training is at the heart of every LLM model, but it’s still an application running on an infrastructure, which means it can fail. Our GPU test helps you test your training GPUs so you don’t lose that valuable work. TRANSCRIPT: One of the things we built recently was the GPU Gremlin. So if you are training a bunch of models and you're doing a bunch of GPU testing. You know, we want to give you the tools to be able to go test that, to understand how training the model could fail.