Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

How the Gremlin agent fails safely

Testing shouldn’t feel risky. While it might sound counterintuitive, certain types of testing can actually increase risks to your systems. Load testing, for example, is a great way to see how your systems behave under pressure, but it can also cause those same systems to fail if they aren’t equipped to handle the load. For some types of testing, this is necessary, as is the case with reliability testing and Chaos Engineering.

How to fix the root cause of a failed reliability test

You’re well on your way to becoming more reliable. You’ve added your services, found and fixed some Detected Risks, and run your first set of reliability tests. However, some of your tests returned as “Failed.” Not to worry: this isn’t a reflection of you or your engineering skills but rather an opportunity to learn more about how your systems work and, more importantly, how to make them more resilient.

Maximizing your reliability on AWS

Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too. It’s the provider’s job to offer stable infrastructure, but you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant. There’s only one problem: cloud platforms are essentially black boxes.

What's the ROI of reliability?

Reliability doesn’t happen by itself. Making a system reliable and resilient enough that your customers can count on it takes a combination of time, effort, and resources that could be used elsewhere, such as shipping new features. It’s also not optional. In an era where downtime costs an average of $14,056/min (or $843,360/hr), outages have a material impact on businesses. Unfortunately, most systems are sprawling and complex enough that even small amounts of downtime can add up quickly.

Manage your reliability work more easily with Gremlin's newest features

Reliability testing is ongoing work, and tracking that work can be difficult in large organizations. Engineers run one-off experiments, scheduled Scenarios run in the background, and, for more mature teams, CI/CD workflows fire off automated tests on demand. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on in Gremlin—until now.

Gremlin's 2024 year-end Release Roundup

It’s been a busy year at Gremlin! We released two new experiments, added an entirely new onboarding process and features for AWS users, added a brand new Test Suite and Detected Risks, and made many UI improvements to our web app. We beefed up our agents with more enterprise capabilities, including support for large Kubernetes clusters and systems with over 64 CPUs, improved experiment behaviors, improved dependency detection, and per-team Private Network Integrations.

Release Roundup November 2024: Reliability in the serverless and AI era

2024 is coming to a close, and while many teams are slowing down in preparation for the holidays, we’ve been cooking up tons of new features. We’ve extended our platform support to the Istio service mesh, added a brand new experiment type for testing artificial intelligence (AI) and large language model (LLM) workloads, and made it easier to onboard Kubernetes clusters. We’ve also made our Linux and Windows agents more robust and performant.

Now in private beta: Gremlin Service Mesh Extension

Service meshes like Istio have become an essential way to securely and reliably distribute network traffic, especially with ephemeral, service-based architectures such as Kubernetes. However, their constantly shifting nature can interfere with targeting specific services for resilience tests. Infrastructure-based testing is designed to target specific IP addresses, allowing precision testing of applications, VMs, and nodes.

Reliable AI models, simulations, and more with Gremlin's GPU experiment

Note This blog uses “GPU” to refer to the entire processing circuit, including the GPU processor, video memory, and other supporting hardware. ‍ Artificial Intelligence (AI) has become one of the biggest tech trends in years. From generating full movies to updating its own code, AI is performing tasks that were once science fiction.