Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Containers, Kubernetes, Docker and related technologies.

Platform engineering unplugged: What nobody tells you about platform engineering at scale

Most platform engineering stories are told in hindsight, with the rough edges smoothed out. On June 17th, we are doing it differently. Join us for Platform Engineering Unplugged, a frank conversation with a practitioner who has navigated the real challenges of building and scaling platform engineering. What worked, what didn't, and what they would do differently. If you lead engineering teams and are thinking seriously about platform engineering, this is the session for you.

How to build a secure AI agent sandbox with relaxAI and Claude Code

AI agents are powerful. They're also unpredictable, non-deterministic, and capable of doing things you didn't ask them to do, as the Rome Alibaba and Claude Mythos case studies make very clear. The answer isn't to avoid agentic AI. It's to run it properly. In this demo, Ben Norris, founding engineer at relaxAI, shows how to build a fully sandboxed AI agent environment from scratch, an ephemeral Civo VM provisioned via Terraform and GitHub Actions, locked down with egress policies, an unprivileged Linux user, and hard resource caps, running a Claude Code session pointed at the relaxAI API.

Lock-in is not theoretical: What UK organizations told us about cloud exit barriers

For years, vendor lock-in has been discussed as a theoretical risk. A concern to acknowledge in architecture reviews. A box to tick in compliance frameworks. A future problem that might need addressing. Our latest research reveals something more urgent. For UK organizations, lock-in isn't theoretical anymore. It's structural. It's measurable. And it's preventing organizations from acting on their own strategic priorities.

Why We Built Lynx: Bringing Control to the Age of AI Agents

For a decade, one idea has guided everything we’ve built at Tigera: How do you secure a dynamic system with a lot of moving parts that is changing rapidly, with a programmatic approach? Calico has applied that idea for Global 2000 companies running the largest Kubernetes platforms in the world, securing tens of millions of mission-critical transactions every day. Today I’m excited to announce the next chapter of that work: Lynx, a unified control plane for Kubernetes-native AI agents.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

Why developer teams are rethinking their cloud provider this year

The default cloud choice for technically literate teams has shifted. It hasn't shifted dramatically; the major hyperscalers aren't going anywhere, and their enterprise position is still strong, but the conversation that used to start with "which hyperscaler" now genuinely starts with "what do we actually need." That's new.

How to monitor and optimize GPU utilization in the cloud

GPU utilization is one of the most expensive metrics in cloud infrastructure to get wrong. A GPU running at 30% utilization costs the same as one running at 90%, but it's doing a third of the useful work. For workloads measured in tens of thousands of GPU-hours, the difference between average utilization in the 30s and average utilization in the 70s is hundreds of thousands of dollars across the life of the workload.