Operations | Monitoring | ITSM | DevOps | Cloud

Reliability Resolutions: How to build effective reliability programs that won't fade away

Did you know the third week of January is the most common time for people to fail New Year’s Resolutions? It doesn’t matter whether it’s exercising more, learning a new language, or just trying to drink less coffee, that initial surge of fresh New Year’s energy is fading, and if you want to make a resolution stick, this is the key time to make a lasting change. The same is true with any reliability resolutions you might have made.

Recommended Experiments for Production Resilience in Harness Chaos Engineering | Harness Blog

This guide covers battle-tested chaos experiments for Kubernetes, AWS, Azure, and GCP to help you validate production resilience before real failures happen. Start with low blast radius experiments (pod-level) and gradually progress to higher impact scenarios (node/zone failures), always defining clear hypotheses and using probes to measure results. Building reliable distributed systems isn't just about writing good code. It's about understanding how your systems behave when things go wrong.

Chaos Engineering Training: Zonal, Regional Failures and SSL/TLS Certificates Expiration

Learn how to test your system's resilience against critical infrastructure failures. This tutorial demonstrates how to simulate zonal and regional outages to validate your high availability setup, plus how to test SSL/TLS certificate expiration scenarios. Essential for ensuring your applications can handle real-world failure conditions and maintain uptime during certificate-related issues.