Latest Posts

Maximizing your reliability on AWS

Jan 13, 2025 By Andre Newman In Gremlin

Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too. It’s the provider’s job to offer stable infrastructure, but you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant. There’s only one problem: cloud platforms are essentially black boxes.

Read Post

Gremlin

Read more about Maximizing your reliability on AWS

Manage your reliability work more easily with Gremlin's newest features

Jan 6, 2025 By Andre Newman In Gremlin

Reliability testing is ongoing work, and tracking that work can be difficult in large organizations. Engineers run one-off experiments, scheduled Scenarios run in the background, and, for more mature teams, CI/CD workflows fire off automated tests on demand. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on in Gremlin—until now.

Read Post

Gremlin

Read more about Manage your reliability work more easily with Gremlin's newest features

Gremlin's 2024 year-end Release Roundup

Dec 18, 2024 By Andre Newman In Gremlin

It’s been a busy year at Gremlin! We released two new experiments, added an entirely new onboarding process and features for AWS users, added a brand new Test Suite and Detected Risks, and made many UI improvements to our web app. We beefed up our agents with more enterprise capabilities, including support for large Kubernetes clusters and systems with over 64 CPUs, improved experiment behaviors, improved dependency detection, and per-team Private Network Integrations.

Read Post

Gremlin

Read more about Gremlin's 2024 year-end Release Roundup

Release Roundup November 2024: Reliability in the serverless and AI era

Dec 4, 2024 By Andre Newman In Gremlin

2024 is coming to a close, and while many teams are slowing down in preparation for the holidays, we’ve been cooking up tons of new features. We’ve extended our platform support to the Istio service mesh, added a brand new experiment type for testing artificial intelligence (AI) and large language model (LLM) workloads, and made it easier to onboard Kubernetes clusters. We’ve also made our Linux and Windows agents more robust and performant.

Read Post

Gremlin

Read more about Release Roundup November 2024: Reliability in the serverless and AI era

Now in private beta: Gremlin Service Mesh Extension

Dec 4, 2024 By Gavin Cahill In Gremlin

Service meshes like Istio have become an essential way to securely and reliably distribute network traffic, especially with ephemeral, service-based architectures such as Kubernetes. However, their constantly shifting nature can interfere with targeting specific services for resilience tests. Infrastructure-based testing is designed to target specific IP addresses, allowing precision testing of applications, VMs, and nodes.

Read Post

Gremlin

Read more about Now in private beta: Gremlin Service Mesh Extension

Reliable AI models, simulations, and more with Gremlin's GPU experiment

Dec 2, 2024 By Andre Newman In Gremlin

Note This blog uses “GPU” to refer to the entire processing circuit, including the GPU processor, video memory, and other supporting hardware. ‍ Artificial Intelligence (AI) has become one of the biggest tech trends in years. From generating full movies to updating its own code, AI is performing tasks that were once science fiction.

Read Post

Gremlin

Read more about Reliable AI models, simulations, and more with Gremlin's GPU experiment

How reliability engineering can verify disaster recovery plans

Nov 5, 2024 By Gavin Cahill In Gremlin

Disaster recovery plans have always been a crucial part of businesses—especially essential services like banks. These plans help keep your business up and running during a disaster or extreme scenario so you can be there for your customers when they need you the most.

Read Post

Gremlin

Read more about How reliability engineering can verify disaster recovery plans

Three serverless reliability risks you can solve today using Failure Flags

Oct 16, 2024 By Andre Newman In Gremlin

Serverless platforms make it incredibly easy to deploy applications. You can take raw code, push it up to a service like AWS Lambda, and have a running application in just a few seconds. The serverless platform provider assumes responsibility for hosting and operating the platform, freeing you up to focus on your application. Naturally, this raises a question: if something goes wrong, who’s responsible?

Read Post

Gremlin

Read more about Three serverless reliability risks you can solve today using Failure Flags

Best Practices for Testing Zone Redundancy

Oct 16, 2024 By Sam Rossoff In Gremlin

The way the story goes is that in the old days Amazon used to cut power to data centers so they could see if their services were actually redundant across different data centers; and that they only abandoned this practice when EC2 customers started to complain (no matter how many times they were warned their instances might disappear without notice). This story may be apocryphal, but you don’t need to be worried about power loss outages in order to have a given data center go down.

Read Post

Gremlin

Read more about Best Practices for Testing Zone Redundancy

Interpreting your reliability test results

Sep 19, 2024 By Andre Newman In Gremlin

Gremlin’s default suite of reliability tests analyzes critical functions of modern services: scalability, redundancy, and resilience to dependency failures. Services that pass this suite of tests can be trusted to remain available during unexpected incidents. But what happens when a service fails a test? How do you take failed test results and turn them into actionable insights? This blog aims to answer that question.

Read Post

Gremlin

Read more about Interpreting your reliability test results

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Maximizing your reliability on AWS

Manage your reliability work more easily with Gremlin's newest features

Gremlin's 2024 year-end Release Roundup

Release Roundup November 2024: Reliability in the serverless and AI era

Now in private beta: Gremlin Service Mesh Extension

Reliable AI models, simulations, and more with Gremlin's GPU experiment

How reliability engineering can verify disaster recovery plans

Three serverless reliability risks you can solve today using Failure Flags

Best Practices for Testing Zone Redundancy

Interpreting your reliability test results

Monthly Archive

Follow Us