Operations | Monitoring | ITSM | DevOps | Cloud

The riskiest thing you can do is not measure your risk

Hiring good engineers is important, but it’s not enough to prevent outages. You need to measure and track your risk to get real results. Full transcript:   My name's Jeff Nickoloff. I'm a principal engineer here at Gremlin.  What I hear non-technical functions talk about is really they are much happier to sort of lean on their great engineers. Oh, we've got a great engineering culture. "We don't have reliability issues because we hire the best people.".

Avoid the Chaos Engineering bottleneck

Chaos Engineering is great, but by itself it can create bottlenecks that limit your reliability journey. FULL TRANSCRIPT: One of the things we've learned while building Gremlin and being the first Chaos Engineering tool to market is with all the greatness that comes with this approach, we've learned some of the downfalls, some of the drawbacks. And one of those is how you scale this practice.

Beyond AI hype: put reliability at the forefront

Reliability is a constant for every technology, whether it’s cloud, microservices, or AI. Full transcript:  Just a few years ago everybody was screaming about microservices, "That's the wave of the future," and now everybody's looking at AI. No matter what the change in technology hot topic is, your reliability should still be at the forefront of everything that you're doing.

Reliability is not about mythical perfection

See what reliability means to Ganesh Seetharaman, Managing Director at Deloitte, and why it's more than high uptime. Full transcript:  Reliability to me is not about achieving mythical perfection. It's about embracing complexity, recovering quickly from failures or incidents, and building trust through transparency and adaptability.

What to expect in a Gremlin workshop

Gremlin workshops give your team hands-on training with Gremlin so they can get real results and dramatically improve your reliability. Full transcript:  The goal of our workshops is really to accelerate you and the team in your reliability journey. Whether you're starting out for the first time, or you're a more advanced user, this workshop is really designed for you to take you to the next level.

Lessons from Alaska's outage: Redundant resilient

Last Sunday, Alaska Airlines suffered a three-hour outage that led to more than 200 flight cancellations and disrupted 15,600 passengers. The culprit? “A critical piece of multi-redundant hardware at our data centers, manufactured by a third-party, experienced an unexpected failure. When that happened, it impacted several of our key systems that enable us to run various operations, necessitating the implementation of a ground stop to keep aircraft in position.”

Measure your reliability risk, not your engineers

Do you know the current reliability risk of your systems? Do you know right now how your services will react to common failures like a dependency going down? Sadly, most organizations don’t have answers to these questions, relying on QA tests and the skill of their engineers to deploy code they assume won’t break. But this is a process problem, which means you can’t hire your way out of it.

Reliability is about more than uptime

Reliability results are more than whether your application is up, it's about proactive measurement and keeping it up. Full transcript:  Reliability results in my earlier career was, "Is there any downtime? Are there any errors that are getting thrown?" It's not a proactive way to measure your reliability. If you're measuring it in time of production, it's not gonna be an accurate reflection of what your reliability is. The way that my mindset has changed over time has been a proactive measurement. Before we ship something out, is this gonna be reliable from the start?

How to ensure your AWS workloads are resilient

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Cloud providers like AWS give you plenty of tools to make your workloads more resilient, but it’s up to you to apply them. However, considering how complex some of these tools are, where do you start? And how can you be sure your systems are more reliable as a result?