Operations | Monitoring | ITSM | DevOps | Cloud

GPU cloud for AI inference in production: How infrastructure requirements change after training

Training a model is a project with an end date. Inference is what happens for the rest of the model's working life. The two workloads share GPUs, frameworks, and a lot of vocabulary, but the infrastructure decisions that make sense during training are usually the wrong ones in production. Teams that treat inference as "training, but smaller" tend to discover the gap somewhere around their first traffic spike.

MCP Servers Are Becoming a Core Interface Layer in Data Observability and Data Quality

Data observability has traditionally been built around human workflows. When data breaks, engineers are alerted, open dashboards, inspect lineage graphs, and manually trace the issue across pipelines. The system is designed for human investigation and interpretation. That model is now being challenged by the rise of AI agents in data operations. As organizations begin embedding AI into analytics, engineering, and decision-making workflows, observability is no longer just about explaining what happened - it must also enable systems to understand and act on it.

Bridging AI and Infrastructure: Introducing the Megaport MCP Server for Agentic Networking

Discover the Megaport MCP Server and how it enables AI-powered, agentic networking through natural language access to network infrastructure. By Miwa Fujii, Community Manager - Terraform and Ryan Tucker, Solutions Architect In the cloud networking era, we’ve moved from manual configurations in the Portal to Infrastructure as Code (IaC), Terraform. But the next frontier isn’t just code, it’s intelligence. We are pleased to announce the release of the Megaport MCP Server (Open Beta).

Beyond tokens per watt - using Ubuntu 26.04 LTS for AI

Tokens per watt (TpW) – the measure of useful AI work produced per watt of energy consumed – is the metric at top of mind for CEOs, heads of AI, and infrastructure teams alike. With the tremendous cost of GPU clusters, extracting as much value as possible from the expense is critical. But in the pursuit of tokens, it’s important to remember that hardware efficiency isn’t the only factor influencing data center operating costs, or the output of useful, revenue-generating AI work.

AI Agent Governance: The Missing Piece of Autonomous IT

AI agents are making decisions, accessing systems, and resolving issues autonomously. But as organizations deploy more agents, one challenge becomes impossible to ignore: governance. Who has access? What changed? Who is accountable? The future of Autonomous IT requires autonomy with accountability.

A package manager for AI assets (and why the lock file is per-user)

Sometime in the last two years your repos quietly filled up with a new category of file. Not code, not config exactly: prompts. A.claude/skills/ directory here. A.cursor/rules/ folder there. A CLAUDE.md at the root, an AGENTS.md next to it, a.mcp.json listing the servers your agent is allowed to call. These are the things that make a coding agent useful on your codebase, and they're sprawling.

Asimov's Zeroth Law of Robotics: testing and observing AI (ExpoQA 2026)

Asimov's Three Laws of Robotics are missing one — and when it comes to testing and observing AI, Nicole van der Hoeven argues that missing rule changes everything: before a robot can avoid harm, obey orders, or protect itself, there has to be a Zeroth Law: a robot must be observable. Because if you can't see what a system is doing, you have no way of knowing whether it's following any rule at all.

Why Engineers Don't Trust Autonomous AI - 4th Annual Observability Survey | Grafana Labs

The 2026 Observability Survey from Grafana Labs heard from over 1,300 engineers and leaders across 76 countries on the real-world role of AI in observability. The data reveals a sharp distinction between intelligence and autonomy — and a critical blind spot most teams have.