Operations | Monitoring | ITSM | DevOps | Cloud

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

From Blueprint to Production: Building a Kubernetes MCP Server

As Large Language Models (LLMs) evolve from simple chatbots into agentic workflows, the need for a standardized way to connect them to external data and infrastructure has become critical. In a recent workshop hosted by Nir Adler, Innovation Engineer at Komodor, we explored how to bridge this gap using the Model Context Protocol (MCP).

#052 - The "Short Long Path": Mastering Abstraction, Culture, and Kubernetes Scale with Shemer M...

In this episode, Itiel joins forces with Shemer, Director of Platform Solutions at the gaming giant Playtika, and Scott Rosenberg, Lead Architect at TeraSky, to discuss the realities of platform engineering at a massive scale. The trio dissects Playtika’s multi-year journey from a legacy, homegrown Kubespray infrastructure to a modern, holistic platform built on Spectro Cloud, all while running strictly on-premise to support 25+ games and high-volume traffic.

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

How Cisco Revolutionized Platform Engineering with Komodor's Agentic AI

In the world of cloud-native infrastructure, complexity is the silent killer of innovation. For Cisco Outshift, the company’s incubation engine, managing a sprawling environment of AWS EKS clusters and edge-based MicroK8s workloads created a classic bottleneck: the Platform Engineering team was drowning in toil. Facing SRE burnout and the limits of human scaling, Cisco embarked on an ambitious journey to evolve its internal operations from standard DevOps to Agentic AI.

#051 - Surviving the Shift: From Legacy Monoliths to Day 2 Chaos with Hayato Shimizu (Digitalis)

From the early days of "neural nets" and WebSphere to the modern complexities of Kubernetes, Hayato Shimizu has seen the evolution of infrastructure firsthand. In this episode of Kubernetes for Humans, the co-founder of Digitalis joins the show to discuss the harsh realities of enterprise platform engineering and his personal journey from corporate employee to consultancy owner.

AI SRE in Practice: Resolving Node Termination Events at Scale

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

[Webinar] Building Quality-Driven Agentic AI in Noisy Big Data Environments

Watch as Itiel Shwartz, Komodor CTO and Co-Founder as he shares hard-won lessons from developing an AI agent that processes millions of K8s events daily to deliver autonomous troubleshooting that reached 95%+ accuracy in benchmarking. This webinar covers: Building production ready systems that maintain reliability when 90% of your data is noise. How Komodor developed an AI SRE agent that processes millions of K8s events daily to deliver autonomous troubleshooting that reached 95%+ accuracy in benchmarking.

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Deployments fail for dozens of reasons. Most of them are obvious from the error messages or pod events. But when a deployment rolls out successfully according to Kubernetes but your application starts experiencing latency spikes and error rate increases, the investigation becomes significantly harder. This scenario walks through a configuration drift incident where the deployment appeared healthy but available replicas were constantly flapping, creating cascading reliability issues.