Operations | Monitoring | ITSM | DevOps | Cloud

Reliability isn't a metric, it's a mindset

As someone with Type 1 diabetes, reliability is a way of life for Nick Mason, Sr. Solutions Architect at Gremlin. Full transcript: Reliability isn't just a metric, to me, it's a mindset. As someone that works in site reliability engineering and also someone who lives with type one diabetes, the concept of reliability is deeply personal to me. In tech, reliability means building systems that are going to recover gracefully and in life with a chronic condition like diabetes, it's the same thing.

Reliability means being there right when your customer needs you

When your systems are reliable, it means your customers can count on your applications to be there for them. Full transcript:  To me reliability means a good night's sleep, and being able to confidently go to bed and wake up the next day feeling ready to get out there and do my best work and not worry about the experience that our customers might have had through the night.

4 Chaos Engineering recommendations from Gartner

Gartner recently published their annual Hype Cycle reports, including the Hype Cycle for Infrastructure Platforms. Designed to help heads of infrastructure and IT operations make informed decisions about infrastructure platforms, it includes over thirty different topics covering everything from platform engineering to distributed cloud to policy as code—including Chaos Engineering and Site Reliability Engineering.

Why we're talking to people about reliability

Reliability means a lot of things to a lot of people, but it’s also essential for every digital business. That’s why we’re talking to reliability experts from all over to find out what reliability means to them and how you can improve it. Transcript:  You know, we're all out here building and operating digital businesses and like nobody's talking about reliability enough. We gotta talk about it. I can't stop talking about it and I've been on call for like 20 years.

Insights to keep AI applications reliable

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations. But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

How to test your systems for scalability and redundancy with fault injection

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region? Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

How to be prepared for cloud provider outages

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products. But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

How to set up chaos engineering in your CI/CD pipeline with CircleCI and Chaos Toolkit

Distributed architecture is increasingly being adopted in current software systems because it brings great scalability and flexibility, keeping them resilient under real-world conditions, Unfortunately, this new distribution also introduces new points of failure in the systems. Traditional testing methods are no longer enough; they focus only on whether a system works, not on whether it keeps working under stress or failure. That is where chaos engineering comes in.

How to test Istio and other service meshes

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Service meshes bring applications together, but not always reliably. Even the most well-configured Istio deployment can have unexpected reliability risks that aren’t apparent until you’re already in production. Latency, single points of failure, poorly defined APIs—these problems can grow beyond a single service and impact the user experience for your entire application.