Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

What is Gremlin?

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

Gremlin for DORA compliance: how financial services firms build digital resilience-and prove it

The Digital Operational Resilience Act (DORA) is set to significantly impact the financial sector. Coming into full effect in 2025, this EU regulation will set new standards for information and communications technology (ICT) risk management. In this landscape, how can financial firms ensure they’re not only compliant, but also operationally resilient?

Ensuring consistent Kubernetes container versions

One of Kubernetes' killer features is its ability to seamlessly update applications no matter how large your deployment is. Did a developer make a code change, and now you need to update a thousand running containers? Just run kubectl apply -f manifest.yaml and watch as Kubernetes replaces each outdated pod with the new version.

How to detect and prevent memory leaks in Kubernetes applications

In our last blog, we talked about the importance of setting memory requests when deploying applications to Kubernetes. We explained how memory requests lets you specify how much memory (RAM for short) Kubernetes should reserve for a pod before deploying it. However, this only helps your pod get deployed. What happens when your pod is running and gradually consumes more RAM over time?

Enterprise Chaos Engineering Certification Prep Session

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.

Release Roundup Sept 2023: Measurably improve reliability

It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.

How to ensure your Kubernetes Pods have enough memory

Memory (or RAM, short for random-access memory) is a finite and critical computing resource. The amount of RAM in a system dictates the number and complexity of processes that can run on the system, and running out of RAM can cause significant problems, including: This problem can be mitigated using clustered platforms like Kubernetes, where you can add or remove RAM capacity by adding or removing nodes on-demand.

How a simple metric drives reliability culture at Slack

How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates? Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators.

Deploying a multi-availability zone Kubernetes cluster for High Availability

Many cloud infrastructure providers make deploying services as easy as a few clicks. However, making those services high availability (HA) is a different story. What happens to your service if your cloud provider has an Availability Zone (AZ) outage? Will your application still work, and more importantly, can you prove it will still work? In this blog, we'll discuss AZ redundancy with a focus on Kubernetes clusters.