Operations | Monitoring | ITSM | DevOps | Cloud

How to find Kubernetes reliability risks with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

Three key facts about serverless reliability

Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability. In this blog, we’ll look at three important facts about serverless reliability that teams often overlook.

Ensuring your AI systems can scale to meet demand

The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.

How to keep track of what's running in your Gremlin team

•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.

How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage. But there was one snag: the service was built using AWS Lambda. This meant infrastructure-focused tests would have trouble replicating the failure conditions necessary to test the failover due to Lambda’s serverless model.

Simulating artificial intelligence service outages with Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button. As a result, more applications have been able to use AI services for data analysis, content generation, media production, and much more.

Improving Resilience for GenAI Workloads on AWS

GenAI can do incredible things, but like any technology, its success depends on how we implement and use it. Without proper implementation, GenAI failures can pose significant risks to your organization's reputation and customer trust, leading to real financial impact. And like any other application, regulatory rules, SLAs, and reliability standards still apply to GenAI. With more companies integrating GenAI into their systems and products, it’s essential to make sure GenAI workloads and applications are highly available to deliver an exceptional user experience.

Three reliability best practices when using AI agents for coding

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment. Fortunately, reliability testing can help you catch these errors before they cause outages.

How to Build Observability into Chaos Engineering

If you've ever deployed a distributed system at scale, you know things break—often in ways you never expected. That’s where Chaos Engineering comes in. But running chaos experiments without robust observability is like debugging blindfolded. This guide will walk you through how observability empowers Chaos Engineering, ensuring that your experiments yield meaningful insights instead of just causing chaos for chaos’ sake.