Operations | Monitoring | ITSM | DevOps | Cloud

Cost Optimization Is Now Part of the SRE Playbook

In the era of cloud-native architectures, Site Reliability Engineering (SRE) has matured from a discipline focused purely on uptime to a sophisticated practice of efficient reliability. The key driver for this evolution is an undeniable truth: cloud spend has become intrinsically linked to system stability.

Welcome to the Next Frontier: AI on Kubernetes

Last week’s KubeCon Atlanta made one thing abundantly clear, Kubernetes is quickly becoming the de facto platform for AI workloads – with the event lineup chock full of talks, workshops, and even co-located events dedicated to AI, machine learning and running data on Kubernetes natively – with approximately 50 (!) sessions in total focused on AI, ML, LLM, and GenAI topics.. What was until now mostly PoCs and aspirational is now truly delivering in production.

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

This year’s KubeCon underscored a real shift: AI SRE has gone mainstream. Of course, it’s not a surprise. Teams from high-growth startups to Fortune 500s are running more complex, cloud-native systems, shipping more AI-generated code, and facing rising expectations. Downtime is absolutely not an option and the work for on-call SREs has become unsustainable. The question isn’t whether AI SRE helps. It’s which one you can trust in production.

Autonomous Self-Healing Capabilities for Cloud-Native Infrastructure and Operations

Modern cloud-native infrastructure was adopted to increase agility and scale, but as it grows in scale and complexity, engineering teams are now drowning in operational noise. Industry research (The State of Observability for 2024) reveals that 88% of technology leaders report rising stack complexity, while 81% say manual troubleshooting actively detracts from innovation.

#050 - Data Protection and Kubernetes Resilience with Michael Cade & Julia Furst Morgado (Veeam)

In this episode Itiel hosts Veeam experts Julia and Michael, to share their distinct paths into cloud-native technology. Julia discusses her transition from a background in law and marketing to becoming a CNCF ambassador and AWS container hero. Michael, a veteran who has been with Veeam for over 10 years, details his traditional CIS admin background (virtualization, storage) and the evolution of this role into platform engineering.

[WEBINAR] From Blueprint to Production: A Live Workshop for Creating an MCP Server for Kubernetes

In this hands-on workshop, we covered how to build your own MCP server from scratch and connect it to AI tools like Cursor IDE or Claude Desktop. The first half is a live coding session you can follow along with to set up an MCP server for Kubernetes troubleshooting. In the second half, we take you behind the scenes at Komodor to show how we built our MCP Server MVP: a powerful bridge between AI assistants and Kubernetes infrastructure. This is just part of the 'magic' that helps the Klaudia agentic AI technology power Komodor's AI SRE Platform.

Kubernetes v1.34: What You Need to Know

Kubernetes v1.34, codenamed “Of Wind & Will (O’ WaW)”, brings a wide range of enhancements aimed at making clusters more efficient, secure, and easier to manage. This release delivers 58 enhancements with 23 graduating to Stable, 22 entering Beta, and 13 in Alpha, reflecting the platform’s continued maturation as enterprises scale their container orchestration needs.

#049 - The AI Translator: Using LLMs & MCP for K8s Operations & Self-Healing Infra with Alexei Le...

In this episode, Itiel Shwartz kicks off a series on MLOps, LLM, and GenAI in Kubernetes. Starting with Alexei Ledenev, who has over two decades in software development and deep experience in cloud architecture and distributed systems. He shares his journey from CoreOS Fleet to his current role on the Platform Team at Doit.

#048 - Shaping the Future of Software Development with Idan Gazit (GitHub Next)

Meet Idan Gazit from GitHub Next, a team responsible for projects like GitHub Copilot. Gazit, despite jokingly claiming to be "the least knowledgeable about Kubernetes," shares his diverse career journey, spanning from early web development with Perl and Django to his time at Heroku and eventually GitHub. He discusses his team's role in prototyping future software development solutions, emphasizing the importance of identifying and nurturing risky, impactful ideas for developers, even if it means "killing projects" that don't gain traction.