Operations | Monitoring | ITSM | DevOps | Cloud

SRE's Guide to Chaos & Observability

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure. Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

Building Reliable Applications Webinar 6 17 21

Test-driven development (TDD) is a process that ensures quality in the applications we develop while guarding against feature creep/skew. But as our applications have become increasingly complex, traditional testing methods are not enough. Traditional testing only evaluates what we know, but complex systems often fail due to unknowns—the things that are almost impossible to test because we are unaware of them. Chaos Engineering is the exception that allows us to test for what we don’t know.

HUG Relies on PagerDuty When Healthcare Incidents Arise

The Geneva University Hospital (HUG) is one of the five university hospitals in Switzerland and one of the largest hospitals in Europe. Pierryves Fournier, SRE Team Lead at HUG, explains how PagerDuty and Rundeck help automate his team's incident response process, empowering the right action when seconds matter.

Dissecting DevOps - Measuring quality in a SaaS world: SLA, SLI, SLO

Now that software is delivered over the web and not in a box, how developers guarantee quality to their users has radically changed. Users do not care about version numbers or floppy disks. They just want access to a service that just works. In the microservices world, the quality of your service both to your internal users and external is measured by SLAs, SLIs, and SLOs. And how you decide what those metrics are is a key strategy.

The Ops Agent is now GA and it leverages OpenTelemetry

Running and troubleshooting production services requires deep visibility into your applications and infrastructure. While basic logs and metrics are available out of the box with Google Cloud Compute Engine (GCE), capturing advanced data used to require the installation of both a metrics agent and a logging agent.