Operations | Monitoring | ITSM | DevOps | Cloud

Latest Videos

How to run Chaos Engineering experiments in your CI/CD pipeline

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Ad-hoc Chaos Engineering experiments are great for learning more about how your systems work, but they don’t tell you how your systems behave over time. As new features get deployed, environments change, and regressions get introduced, even the most resilient systems can gain reliability risks. QA and performance testing are already built into CI/CD - why not reliability?

Confident Cloud Migrations How a Top 5 Bank Ensured Reliability With AWS and Gremlin

In today's competitive landscape, migrating to the cloud brings substantial benefits, but the cloud’s new architectures and tools also bring new reliability risks and considerations. The challenge: Enterprises have to figure out how to capitalize on the benefits of the cloud while ensuring a seamless, reliable transition. This webinar offers a look at how to provide application reliability before, during, and after migrations with AWS and Gremlin.

Building Resilience in the Cloud With the AWS Well Architected Framework and Gremlin

Reliability and resilience in the cloud requires a different approach. Thankfully, the AWS Well-Architected Framework is a proven blueprint for cloud architects and engineering leaders seeking to design and operate resilient systems on AWS.

How to test your systems for scalability and redundancy with Fault Injection

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?‍ Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

How to find Kubernetes reliability risks with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

How to find and test critical dependencies with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

Kubernetes Reliability Risks: How to monitor for critical issues at scale

Learn how to automatically find and fix the most critical Kubernetes reliability risks in enterprise organizations. Recent research shows that nearly every organization has reliability risks in their Kubernetes clusters. Many of them are caused by simple misconfiguration, but they can have devastating consequences—including taking critical services offline. And while you could manually review every Kubernetes deployment, the speed and scale at which most organizations deploy to Kubernetes makes that impractical.

Building a Culture of Reliability: Why SREs Can't Do It Alone

Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.

What is Gremlin?

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

Enterprise Chaos Engineering Certification Prep Session

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.