Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering works, but it has to scale

Over the years, Chaos Engineering has proven its effectiveness time and time again, uncovering risks and saving companies millions they would have lost in painful, brand-impacting outages. But as Chaos Engineering adoption increased, we found organizations running into the same stumbling blocks when they tried to scale. Individual teams would get great results with Chaos Engineering, then stall as they tried to get more teams involved.

3 things you can do to get closer to five nines

5 minutes. That’s how much downtime some of the world’s largest enterprises will tolerate. For most organizations, five nines (99.999%) of availability sounds like a pipedream. But the trick to increasing availability isn’t massive infrastructure spending or complex system redesigns. All it takes are three key practices that any team can adopt and implement. In this post, we’ll present these practices and how we implement them at Gremlin.

{unscripted} AI in Chaos Engineering

Harness AI enhances your chaos engineering capabilities by leveraging artificial intelligence to automate and optimize reliability testing and analysis. One of the challenges of scaling up the Chaos Engineering practice within the organization is skilling up the users to create or run chaos experiments and to come up with solutions to mitigate the risks that are identified during the chaos experiment execution. The Chaos Engineering module comes with an AI Agent called "AI Reliability Agent" that helps in these aspects.

AI-Powered Chaos Engineering with Harness MCP Server and Cursor

The Harness MCP Server integration with Cursor transforms chaos engineering from a complex, specialized discipline into an accessible, conversational workflow that any developer can leverage directly within their AI-powered IDE. By combining natural language prompts with comprehensive resilience testing tools, teams can discover, execute, and analyze chaos experiments without vendor-specific expertise, democratizing system reliability across DevOps, QA, and SRE functions.

Security vs. ops: the two sides of reliability

Security and ops work together to keep your systems reliable, but why do we treat them so differently? Reliability results start when you proactively take charge of your infrastructure and application risks. Transcript: When we talk about reliability in the software space and the digital operations space, you really end up falling into these two different mindsets.

Reliability means smooth on-call and a strong team

True reliability is when your engineers have confidence in their systems and their teams. Full transcript: Reliability to me means my on-call shift is gonna be smooth because everybody is making the attempts to be smart about the type of code that we're writing. And we're regularly testing to make sure that our system has redundancy and can withstand latency spikes, it can withstand resource spikes.