Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

How to detect and prevent memory leaks in Kubernetes applications

In our last blog, we talked about the importance of setting memory requests when deploying applications to Kubernetes. We explained how memory requests lets you specify how much memory (RAM for short) Kubernetes should reserve for a pod before deploying it. However, this only helps your pod get deployed. What happens when your pod is running and gradually consumes more RAM over time?

Enterprise Chaos Engineering Certification Prep Session

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.

Release Roundup Sept 2023: Measurably improve reliability

It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.

How to ensure your Kubernetes Pods have enough memory

Memory (or RAM, short for random-access memory) is a finite and critical computing resource. The amount of RAM in a system dictates the number and complexity of processes that can run on the system, and running out of RAM can cause significant problems, including: This problem can be mitigated using clustered platforms like Kubernetes, where you can add or remove RAM capacity by adding or removing nodes on-demand.

How a simple metric drives reliability culture at Slack

How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates? Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators.

Deploying a multi-availability zone Kubernetes cluster for High Availability

Many cloud infrastructure providers make deploying services as easy as a few clicks. However, making those services high availability (HA) is a different story. What happens to your service if your cloud provider has an Availability Zone (AZ) outage? Will your application still work, and more importantly, can you prove it will still work? In this blog, we'll discuss AZ redundancy with a focus on Kubernetes clusters.

How to keep your Kubernetes Pods up and running with liveness probes

Getting your applications running on Kubernetes is one thing: keeping them up and running is another thing entirely. While the goal is to deploy applications that never fail, the reality is that applications often crash, terminate, or restart with little warning. Even before that point, applications can have less visible problems like memory leaks, network latency, and disconnections. To prevent applications from behaving unexpectedly, we need a way of continually monitoring them.

Automate reliability testing in your CI/CD pipeline using the Gremlin API

For many software engineering teams, most testing is done in their CI/CD pipeline. New deployments run through a gauntlet of unit tests, integration tests, and even performance tests to ensure quality. However, there's one key test type that's excluded from this list, and it's one that can have a critical impact on your application and your organization: reliability tests. As software changes, reliability risks get introduced.

How to ensure your Kubernetes Pods have enough CPU

Gremlin's Detected Risks feature immediately detects any high-priority reliability concerns in your environment. These can include misconfigurations, bad default values, or reliability anti-patterns. A common risk is deploying Pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your Pod from running.