Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization. Today, we fulfill the next stage of that promise with the release of Custom Reliability Test Suites, Custom Scoring, and Dashboards.

Treat reliability risks like security vulnerabilities by scanning and testing for them

Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis. Even if the code passed all the security checks before being deployed, you still perform regular security tests to make sure everything’s secure.

Building a Culture of Reliability: Why SREs Can't Do It Alone

Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.

How to fix and prevent ImagePullBackOff events in Kubernetes

You'll often hear the term "containers" used to refer to the entire landscape of self-contained software packages: this includes tools like Docker and Kubernetes, platforms like Amazon Elastic Container Service (ECS), and even the process of building these packages. But there's an even more important layer that often gets overlooked, and that's container images.

How to fix and prevent CrashLoopBackOff events in Kubernetes

It's one of the most dreaded words among Kubernetes users. Regardless of your software engineering skill or seniority level, chances are you've seen it at least once. There are a quarter of a million articles on the subject, and countless developer hours have been spent troubleshooting and fixing it. We're talking, of course, about CrashLoopBackOff.

What is Gremlin?

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

Gremlin for DORA compliance: how financial services firms build digital resilience-and prove it

The Digital Operational Resilience Act (DORA) is set to significantly impact the financial sector. Coming into full effect in 2025, this EU regulation will set new standards for information and communications technology (ICT) risk management. In this landscape, how can financial firms ensure they’re not only compliant, but also operationally resilient?

Ensuring consistent Kubernetes container versions

One of Kubernetes' killer features is its ability to seamlessly update applications no matter how large your deployment is. Did a developer make a code change, and now you need to update a thousand running containers? Just run kubectl apply -f manifest.yaml and watch as Kubernetes replaces each outdated pod with the new version.

How to detect and prevent memory leaks in Kubernetes applications

In our last blog, we talked about the importance of setting memory requests when deploying applications to Kubernetes. We explained how memory requests lets you specify how much memory (RAM for short) Kubernetes should reserve for a pod before deploying it. However, this only helps your pod get deployed. What happens when your pod is running and gradually consumes more RAM over time?

Enterprise Chaos Engineering Certification Prep Session

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.