Operations | Monitoring | ITSM | DevOps | Cloud

December 2023

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up with Pods that have been "Pending" for several minutes? In this blog, we'll dig into the reasons Pods fail to schedule. We'll look at why it happens, how to troubleshoot it, and ways you can prevent it.

Kubernetes Reliability Risks: How to monitor for critical issues at scale

Learn how to automatically find and fix the most critical Kubernetes reliability risks in enterprise organizations. Recent research shows that nearly every organization has reliability risks in their Kubernetes clusters. Many of them are caused by simple misconfiguration, but they can have devastating consequences—including taking critical services offline. And while you could manually review every Kubernetes deployment, the speed and scale at which most organizations deploy to Kubernetes makes that impractical.

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it off to the main container, but they introduce an additional point of failure. In this post, we'll take an in-depth look at init containers in Kubernetes: what they are, how they work, how they can fail, and what that means for your Kubernetes deployments.

Release Roundup Dec 2023: Driving reliability standards (and much more)

2023 is coming to a close and the holiday season is here, but that doesn’t mean things at Gremlin are slowing down. In fact, we’ve released a ton of new features and improvements to make testing and improving reliability even easier. Now you can run Chaos Engineering experiments in serverless environments, create custom reliability test suites, create more flexible Scenarios, and more easily identify critical components in your environment.

Failure Flags helps build testable, reliable software-without touching infrastructure

Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to reliably root out issues before they impact customers. However, most current Chaos Engineering and resilience testing is focused on the underlying infrastructure. This helps identify potentially catastrophic failures, but misses the more frequent failures that still significantly impact customer experience.