Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

AI SRE Summit 2026 Brings Together Engineering Leaders From AWS, Salesforce, Man Group, Smarsh, Honeycomb and More

Virtual event will explore what it takes to use AI in production SRE, from incident response and observability to platform design, cost control and self-healing operations TEL AVIV and SAN FRANCISCO, April 22, 2026 — Komodor, the autonomous AI SRE company, today announced it will host AI SRE Summit 2026, a free live virtual event on Tuesday, May 12, 2026, bringing together site reliability, platform engineering and cloud-native leaders to discuss how AI is changing production operations, and where i

Resolve's Agents of IT podcast - Ep. 17 - Agentic Workflows to Performance Intelligence

In this episode of Agents of IT, Ari Stowe sits down with Geoff McQueen, four-time founder and CEO of Ascendius, to unpack what it takes to navigate AI-driven disruption. Geoff shares a clear framework for where automation is headed, from individual AI use to agent-driven workflows to AI embedded across the business. Most organizations are still early. The real opportunity is in making AI work at the business level.

Nagios Plugins Collector: Run Your Existing Checks and Custom Scripts Inside Netdata

A lot of teams have a collection of Nagios plugins and custom monitoring scripts that have been running reliably for years. Some are standard community plugins for checking disk health or SSL certificate expiry. Others are homegrown Bash or Python scripts that check something very specific to the business: whether an API endpoint returns the right payload, whether a batch job completed on time, whether a queue depth is within bounds.

Under the Hood: Engineering JFrog Premium Availability

In the modern software factory, 99.9% uptime is no longer the gold standard. A standard 99.9% SLA translates to approximately 43 minutes of unexpected downtime per month. While industry data shows that a single minute of downtime costs an average of $9,000, for large global enterprises, that figure can easily be 5x higher. At tens of thousands of dollars per minute, those 43 minutes quickly compound into a catastrophic financial and operational risk.

What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

You probably have Datadog. Maybe New Relic, maybe Dynatrace. Your observability stack has been solid for years — and you're still flying blind on AI cost. Here's why LLM observability needs a fourth pillar most tools skip, and how to build one that actually tells you what your models are costing you per request, per feature, per customer.

Blind Tokenmaxxing Is The New Cloud Waste. Focus on Outcome-Maxxing Instead

Meta's internal token leaderboard sparked a frenzy — and a reckoning. Tokenmaxxing without attribution is just cloud waste 2.0. Companies like Hudl and Duolingo use cost intelligence to connect every AI dollar to a business outcome.

Announcing Kosli's brand new docs

Good docs are how developers work with a product, from first look to daily use. That’s been true for a long time, and it’s becoming more true as developers increasingly hand that work to agents on their behalf. During the last quarter, we’ve been migrating docs.kosli.com from a static Hugo site to Mintlify, and now it’s finally live. Early reactions from our customers: “A marked improvement over the old docs in layout and usability.” “Looking sharp!”

An Introduction to Disaster Recovery Testing: What You Need to Know in 2026 | Harness Blog

Businesses today run on computers, cloud systems, and digital tools. One big failure can stop everything. A cyber attack, a power outage, or a software glitch can shut down operations for hours or days. Disaster recovery testing is how you prove you can restore critical services when the unexpected happens. 
 In 2026, with hybrid and multi-cloud estates, distributed data, and tighter oversight, this is not a once-a-year fire drill.

How to Install Terraform for Secure and Scalable Infrastructure Automation | Harness Blog

If your Terraform install is insecure or inconsistent, it can quickly slow down your delivery. A single compromised file or a misconfigured backend can stop deployments for many services. Teams that set up Terraform correctly from the start can scale easily and avoid compliance issues.

Beyond the Big Bang: De-risking Cloud Migrations with Progressive Delivery | Harness Blog

At 2 am, your migration goes live. By 2:07, error rates spike, and rollback isn’t an option. Cloud migrations, API rewrites, and architecture transformations rarely fail because of bad code. They fail because of how that code is released. Most teams still rely on a “big bang” cutover where infrastructure, services, and user-facing changes go live at once. This concentrates risk into a single moment.