Operations | Monitoring | ITSM | DevOps | Cloud

Failover and cloud aren't enough for reliability

Amin Momin of @CapgeminiGlobal talks about reliability takes dedicated effort beyond just using the cloud and setting up failover. Full transcript: There are two misconceptions about reliability. One is people only think failover is reliability. Just doing the failover, that will be enough from the reliability point of view. That's the first one. And the second one: we are deployed into the cloud, so it is the service provider's responsibility to provide the reliability.

Fix issues faster with Recommended Remediations

You’ve successfully run a Fault Injection test and uncovered a new failure mode before it impacted customers. And the failure could have taken down your whole system if it had happened in production. Now what? Since this is a potential P1 outage, you absolutely need to address the issue, but that’s going to take some time as you dig through the service to track down the problem. Unfortunately, this is a common conflict.

True reliability takes the whole team

Reliability takes the whole team working together. Full transcript:  If you really want to get good at measuring your reliability, then you have to work together as a team. Once your software engineer organization has decided, "We're gonna test these applications to make sure that they have redundancy, availability, resilience." Just stick to that framework that you come up with as a team.

Reliability upholds your promise to users

Consistent systems are reliability systems according to Ganesh Seetharaman, Managing Director at @Deloitte. Full transcript:   Strong reliability is demonstrated when systems consistently work as expected even during peak demand or unexpected events. When issues do happen, they are resolved quickly and transparently so users experience minimal disruption. Reliability also means data integrity. No matter how much stress the system is under, information needs to be accurate and secure.

How Experiment Analysis uncovers the cause behind failures

Chaos Engineering has proven itself to be incredibly effective at tracking down failure modes, remediating reliability issues, and preventing risks before they happen. Unfortunately, it can also come with a steep adoption curve. In order to get the most out of Fault Injection testing, a practitioner needs to have a deep knowledge of the service, its expected behavior, and the code behind it. Ultimately, the rewards are worth the time.

Reliability is when customers aren't impacted

Ultimately, a system is reliable when customers and engineers can count on it. Full transcript:  When I get to hear stories like, "Hey, we just had our holiday sales event kick off and everything went smoothly and I didn't have to wake up in the middle of the night." That is really the true definition of reliability these people that are constantly hands-on keyboard in charge of making sure that people like myself and like you aren't impacted when we're going to, for example, buy a new pair of sneakers, or we're going to get some sort of limited edition release that's coming out, right?

Reliability Intelligence: your reliability expert

For the last decade, Gremlin has helped Fortune 500 organizations with critical uptime requirements proactively uncover reliability risks and prevent costly outages. We started with Chaos Engineering, then built Reliability Management to help teams standardize and scale their testing efforts. Today, we take another leap forward with the release of Reliability Intelligence. Reliability Intelligence draws on Gremlin expertise with each test to show you what happened and recommend remediation.