Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Who's on call? How Claude helped us calculate this 2,500x faster

Schedules are a core part of any on-call system. In ours, they define who to page and when. But people use them in lots of other ways too: checking their next shift, asking for cover while at the gym, keeping a Slack user group up to date, or updating a Linear triage responsibility. For many of our customers, they’re one of the main ways they interact with our product, and as they’re such a foundational part of On-call, it’s very important they work well.

Stop watching the looms: why the AI era belongs to infrastructure

I live in Manchester, England now. I moved here from Texas last summer (which is its own story), but the thing I wasn't prepared for is how the Industrial Revolution isn't history here. It's the city itself. And if you're American like me, you might need to hear this: the Industrial Revolution didn't start in the US. It started here. Manchester is where the modern world was born. You see it everywhere. The old cotton mills converted into apartments.

What Every IT Operations Team Should Know About Managing IPv4 in 2026

IPv4 was supposed to be a temporary problem. Address exhaustion was meant to push the entire internet toward IPv6 within a decade, and operations teams could simply manage the transition and move on. That hasn't happened. Most enterprise networks still run dual-stack configurations, customer-facing services still depend heavily on IPv4, and the secondary market for addresses has become a permanent fixture of modern infrastructure planning.

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate, including the auth and base64 gotchas. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

All You Need to Know About CrashLoopBackOff Error

Kubernetes is an open-source container orchestration engine that is used to automate containerized application deployment, scaling, and administration. It is an open-source management platform that can be used to manage containerized workloads and services, as well as declarative configuration and automation. Kubernetes is a framework for running distributed systems in a resilient manner. It handles scaling and failover for your application and provides deployment patterns and other features.

Eliminate Manual Authentication Configuration for Fast & Effective API Security Scanning | Harness Blog

Application security testing tools promise coverage and accuracy, but teams often struggle just to get started. One of the biggest friction points in dynamic application security testing is configuring authentication correctly so a scanner can even access a target application, let alone API endpoints that power the functionality. Whether it’s API keys, bearer tokens, or custom auth flows, setting up authentication for scans frequently requires trial-and-error and engineering support.

Understanding disaggregated GenAI model serving with llm-d

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM. Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory.