Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Announcing Shared Scenarios to Promote a Culture of Reliability

Get started with Gremlin’s Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Today, Gremlin is excited to announce the ability to share a Scenario across your entire organization. This allows you to build up a library of reliability exercises that are customized to your company’s applications and technology.

What your company can learn from the Bank of England's resilience proposal

Learn how to modernize your financial systems with confidence while mitigating risk (and meeting compliance). This article was originally published on TechCrunch. The outages at RBS, TSB, and Visa left millions of people unable to deposit their paychecks, pay their bills, acquire new loans, and more.

Announcing Chaos Conf 2020 (Online): Be Prepared For Moments That Matter

We’re excited to announce the third annual Chaos Conf! Given the events with Covid-19 this year, we will be holding this event fully online for the health and safety of attendees. The unforeseen impact of this virus on our lives, our businesses, and our software highlights the importance of preparing for the unexpected. Our theme for this year highlights this: Prepare for moments that matter. Chaos Conf will take place over the course of three days: October 6–8.

Gremlin User Newsletter: AWS App2Container, an update to the WAF, and what's new in Gremlin

As systems become increasingly complex, we’ve seen the growth of engineering tools to abstract away and manage the complexity. But often our tools are “opinionated” and the default actions or settings may not align with how our systems are intended to work or how we think they work. Chaos Engineering is a good way to not only test your applications, but also the tools you use to build them.

Ensuring reliability when modernizing financial applications

For decades, information technology in the financial services industry meant deploying bulky applications onto monolithic systems like mainframes. These systems have a proven track record of reliability, but don’t offer the flexibility and scalability of more modern architectures such as microservices and cloud computing. During periods of unexpectedly high demand, this inflexibility can cause technical issues for organizations ranging from personal trading platforms to major banks.

Testing the reliability of your fulfillment center

Fulfillment pipelines for order management in e-commerce have a lot of intricate moving parts that depend on one another. Sales orders, fulfillment, negotiation, shipment, and receipt are closely interconnected but require different actions while depending on one another closely. You also need messaging around order statuses, conditions, actions, rules, and inventory, just to name a few of the important parts of these complex systems.

What's the reliability of your checkout process?

One of the reasons companies practice Chaos Engineering is to prevent expensive outages in retail (or anywhere, for that matter) from happening in the first place. This blog post walks through a common retail outage where the checkout process fails, then covers how to use Chaos Engineering to prevent the outage from ever happening in the first place. Let’s dive in. Maybe you’ve been there.

Building more reliable financial systems with Chaos Engineering

The financial services industry has built in more capital buffers to prevent market shocks from bringing another economic collapse. In addition to these financial controls, many banks and personal trading platforms have begun building resiliency into information technology shocks. Despite these new precautions, we’re still seeing outages today, preventing customers from depositing and withdrawing their money, completing transactions, and executing trades during key events.

Performance tuning MongoDB with Chaos Engineering

You’ve pored over the MongoDB documentation, crafted highly polished and well-tuned queries, and confidently deployed your new code to production. Everything ran great at first, but once CPU or RAM usage hit a certain point, your queries suddenly slowed to a crawl. What happened, and how can you prepare for situations like this in the future? This is an unfortunate but common scenario with databases like MongoDB.

Announcing Status Checks to Ensure Safe Chaos Engineering Scenarios

One of the most important aspects of any Chaos Engineering program is knowing that every experiment is being run safely. And one of the simplest ways to ensure safe experiments is by having safeguards that prevent running chaos experiments on a system that is unhealthy or has an incident in progress. Today, Gremlin is excited to announce Status Checks, which run before you kick off a Chaos Engineering Scenario in order to verify your system is in a steady state.