Operations | Monitoring | ITSM | DevOps | Cloud

Google Operations

Building a more reliable infrastructure with new Stackdriver tools and partners

Every software organization faces challenges in keeping applications available and running reliably. At Google, we’ve developed and practiced a discipline known as Site Reliability Engineering (SRE). Following SRE practices lets us build and operate services reliably for our billions of users. Google has about 2,500 Site Reliability Engineers who support both internal and external services.

Postmortems and Retrospectives (class SRE implements DevOps)

Even after a service has been restored, SREs still have a bit of work to do. In this video, Liz and Seth discuss the postmortem process that SREs follow. Blameless postmortems and retrospectives are key to learning from failures and preventing recurrence. You will learn about the importance of conducting a postmortem, strategies for conducting a blameless postmortem, and techniques for trending retrospectives across your entire organization to gain better insights to prevent service disruptions in the future.

Disruption Detector and Real Time Monitoring with Stackdriver (Cloud Next '18)

Aja built an interactive disruption detector panel for attendees at the Google I/O Conference to intentionally cause errors to happen to the system. This demo highlights the amazing real time monitoring feature of Stackdriver as it tracks all incoming errors and make things easier for developers to pinpoint the issue. Watch the video to learn more.

Incident Management (class SRE implements DevOps)

In the previous video, Liz and Seth discussed how to make systems observable and how observability helps us diagnose failing systems, but didn't cover what to do when an incident grows beyond the ability of one person to do it all. In this video, you learn about the most important part of the incident management process – humans.

Cloud OnAir: CE TV: Application Observability with LightStep

Observability remains a key challenge as customers embrace DevOps. Join Daniel "Spoons" Spoonhower, the CTO and Founder of Lightstep, a Google Cloud customer, and Yuri Grinshteyn, a Google Cloud Customer Engineer to learn about how Lightstep was built on Google Cloud to enable you to monitor what matters most and diagnose anomalies within seconds across web, mobile, monoliths and microservices.

Using Stackdriver Workspaces to help manage your hybrid and multicloud environment

At Google, we believe strongly in an open cloud. We’re continually working to bring you tools for understanding how your applications are performing, whether they run in different projects, organizations, clouds, or even on prem. Monitoring tools like Stackdriver Kubernetes Monitoring, OpenCensus, and Stackdriver APM are designed to help you get visibility into your workloads wherever they run—on Google Cloud Platform (GCP), on-premises or on another cloud platform.