Operations | Monitoring | ITSM | DevOps | Cloud

Google Operations

SLOs with Stackdriver Service Monitoring

Service Level Objectives or SLOs are one of the fundamental principles of site reliability engineering. We use them to precisely quantify the reliability target we want to achieve in our service. We also use their inverse, error budgets, to make informed decisions about how much risk we can take on at any given time. This lets us determine, for example, whether we can go ahead with a push to production or infrastructure upgrade.

How to use Stackdriver monitoring export for long-term metric analysis

Our Stackdriver Monitoring tool works on Google Cloud Platform (GCP), Amazon Web Services (AWS) and even on-prem apps and services with partner tools like Blue Medora’s BindPlane. Monitoring keeps metrics for six weeks, because the operational value in monitoring metrics is often most important within a recent time window. For example, knowing the 99th percentile latency for your app may be useful for your DevOps team in the short term as they monitor applications on a day-to-day basis.

Stackdriver Trace - Stack Doctor

Welcome to another episode of Stack Doctor. In the last episode, we worked with Stackdriver to set up SLI monitoring for application latency. In this episode, Customer Engineer Specialist, Yuri Grinshteyn, demonstrates what happens to applications with latency issues and how to diagnose and restore your service back to health!

Monitoring Kubernetes Clusters on GKE (Google Container Engine)

The Kubernetes ecosystem contains a number of logging and monitoring solutions. These tools address monitoring and logging at different layers in the Kubernetes Engine stack. This document describes some of these tools, what layer of the stack they address, as well as best practices for implementation including an example from the field, a quick start, and a demo project.