Operations | Monitoring | ITSM | DevOps | Cloud

Why agentic AI development needs reliability guardrails

AI has massively accelerated code deployment. In fact, since the introduction of agentic coding, GitHub has seen exponential growth in PRs, commits, and new repos. What they originally predicted would require 10X capacity, they’re now estimating it’s going to require 30X capacity, and the biggest driver is agentic development. Companies across industries are building agentic pipelines to ship features faster than ever before. That acceleration isn’t without risk.

Learn these 4 Chaos Engineering Principles Before You Break Anything | Resilience Testing | Harness

Want to start chaos engineering? Don't randomly break stuff and hope for the best. Real chaos engineering starts with defining your system's steady state metrics like latency, throughput, and error rates. Then you form a clear hypothesis about what should happen when failures occur. Next, you inject controlled failures, starting small with single pod kills or network drops, not production meltdowns. Finally, you limit the blast radius by running experiments in safe environments first.

Chaos Engineering vs. Traditional Testing: What's the Difference? | Resilience Testing | Harness

Stop treating system outages like surprises and start preparing for them. While traditional software testing is the bedrock of development, using unit, integration, and regression tests to verify that code meets specific requirements, it only accounts for what we expect to happen. Chaos Engineering takes a different approach by shifting the focus from bug prevention to system resilience. Instead of asking "does this work?", Chaos Engineering asks "how does this survive?" by injecting real-world turbulence like network latency or pod failures directly into production-like environments.

What is Chaos Engineering? Explained in 60 seconds | Resilience Testing | Harness

Discover how leading engineering teams proactively build rock-solid applications using Chaos Engineering. Learn why waiting for real outages is risky and how intentionally injecting controlled failures like pod crashes, network latency, and node restarts helps uncover hidden weaknesses before they impact your users. In this short, explore the simple yet powerful practice that turns fragile systems into resilient ones and how Harness makes running chaos experiments effortless and safe with its intuitive Resilience Testing module.

3 Biggest Myths of Chaos Engineering

Are myths about chaos engineering preventing your team from building more resilient systems? In this video, Matt Schillerstrom, Director of Product Management at Harness and founding engineer of the chaos engineering program at Target.com, breaks down the three most common misconceptions about chaos engineering. Drawing from his experience building large-scale programs, Matt explains how to move past these myths to build confidence in your infrastructure.

The hidden reliability risks in your agentic AI workflows

Artificial intelligence recently took a major leap from “saying” to “doing.” Instead of simple back-and-forth chats, we’re now allowing automated AI processes to take action on our behalf—from responding to emails to building and deploying complete applications. This shift from “assistant” to “actor” can make applications more capable, but it also creates additional failure modes.

Test your AI model training reliability, too

Training is at the heart of every LLM model, but it’s still an application running on an infrastructure, which means it can fail. Our GPU test helps you test your training GPUs so you don’t lose that valuable work. TRANSCRIPT: One of the things we built recently was the GPU Gremlin. So if you are training a bunch of models and you're doing a bunch of GPU testing. You know, we want to give you the tools to be able to go test that, to understand how training the model could fail.

How Gremlin makes disaster recovery testing easier and faster

There’s a common saying: “A backup isn’t a backup until you’ve tested it.” The same is true whether it’s a simple database failover or an entire data center/cloud provider failover. You simply won’t know if it works if you don’t test it. When it comes to disaster recovery testing, that can be an expensive, painful, and arduous process. But it’s required by companies for a reason. And not just for disasters like hurricanes, flooding, or earthquakes.

From Chaos Engineering to Resilience Testing: Why We're Expanding How Teams Validate Reliability | Harness Blog

At Harness, we’re committed to helping teams build and deliver software that doesn’t just work – it thrives under pressure, scales reliably, and recovers swiftly from the unexpected. Today, we’re taking the next step in that mission by evolving our Chaos Engineering module into Resilience Testing. This evolution reflects how reliability is tested in practice today.