September 2021

Monitoring compute infrastructure with the Cloud Ops Agent

Sep 22, 2021 By Google Operations In Google Operations

How can you improve observability for workloads that use compute infrastructure directly and run on Google Compute Engine instances? In this episode of Engineering for Reliability, we show how you can use the Cloud Operations agent to do just that. Watch to learn about the Cloud Operations Agent, how to install it manually and automatically, and how to use the data it collects to improve the reliability of your services - and keep your users happy!

View Video

Google Operations

Read more about Monitoring compute infrastructure with the Cloud Ops Agent

Maintaining reliable services with advanced Cloud Logging features

Sep 8, 2021 By Google Operations In Google Operations

We’ve covered ingesting, routing, storing, and viewing logs from your services in Cloud Logging already, but what else can you do with all that data? In this episode of Engineering for Reliability, we show how you can use advanced features like alerting on logs, logs-based metrics, and capturing application exceptions in Error Reporting. Watch to learn how you can find issues faster, make your services more reliable, and keep your users happy.

View Video

Google Operations

Read more about Maintaining reliable services with advanced Cloud Logging features

How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

Sep 7, 2021 By Shyam Palani In Google Operations

The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site. To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition.

Read Post

Google Operations

Read more about How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

Operations | Monitoring | ITSM | DevOps | Cloud

September 2021

Monitoring compute infrastructure with the Cloud Ops Agent

Maintaining reliable services with advanced Cloud Logging features

How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

Monthly Archive

Follow Us