Operations | Monitoring | ITSM | DevOps | Cloud

What "AI-Ready Data" actually means for observability teams

Many organizations deploying AI are learning similar lessons right now: the challenge isn’t this or that AI model, it’s the data. According to Gartner, 60% of AI projects will be abandoned by organizations because of failures to support these projects with AI-ready data. Also, 63% of organizations either lack or aren’t sure they have the right data management practices to get there.

Why Your Agentic AI Aspirations Need to Evolve from Models to a Workflow Data Fabric

Enterprise conversations today are dominated by one phrase: Agentic AI. Across boardrooms and innovation labs, organizations are experimenting with copilots, autonomous agents, and AI bots capable of resolving tickets, recommending actions, and orchestrating complex processes. The promise is real — AI that doesn't just generate insights, but takes meaningful action. Here's the uncomfortable truth: most enterprises are architecturally unprepared for the agentic future they're trying to build.

Understanding disaggregated GenAI model serving with llm-d

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM. Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory.

SRE agent vs. traditional engineer: 7 key differences

The role of a Site Reliability Engineer (SRE) is evolving. The focus has shifted from simply working harder during an outage; A new kind of teammate is here to help: the SRE Agent. But what are the key differences when you compare an SRE agent versus a traditional site reliability engineer? This isn’t just a superficial change. It signifies a fundamental alteration in how teams construct and sustain dependable services.

Live Runtime Investigation in Claude Code with Lightrun MCP

In this video, Lightrun’s Dan Putman demonstrates what happens when Lightrun MCP is integrated within Claude Code. See how, once activated, Claude can ask specific questions about what services it can see and instrument in order to perform a deep investigation in production to get to a validated root cause analysis without the friction of redeploying or switching contexts.

Debug Live Production Apps in Codex with Lightrun MCP

Lightrun’s Dan Putman demonstrates the power of the latest Lightrun MCP skill. Watch how your AI code agent can now debug live applications directly in production. By connecting OpenAI's Codex to real-time runtime data via the Lightrun MCP, engineers can now generate and validate hypotheses using live telemetry and snapshots, without breaking flow. Ready to bring runtime context to your AI agents?