Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Troubleshooting services on Google Kubernetes Engine by example

Applications fail. Containers crash. It’s a fact of life that SRE and DevOps teams know all too well. To help navigate life’s hiccups, we’ve previously shared how to debug applications running on Google Kubernetes Engine (GKE). We’ve also updated the GKE dashboard with new easier-to-use troubleshooting flows. Today, we go one step further and show you how you can use these flows to quickly find and resolve issues in your applications and infrastructure.

With SRE, failing to plan is planning to fail

People sometimes think that implementing Site Reliability Engineering (or DevOps for that matter) will magically make everything better. Just sprinkle a little bit of SRE fairy dust on your organization and your services will be more reliable, more profitable, and your IT, product and engineering teams will be happy. It’s easy to see why people think this way. Some of the world’s most reliable and scalable services run with the help of an SRE team, Google being the prime example.

To the cloud and beyond! Planning a multi-year data center migration

A data center migration into the cloud is often a daunting business initiative that can take years as you transition your existing hardware, software, networking, and operations into a brand new environment. In our roles with Google Cloud’s Professional Services organization, we work side by side with customers to collaboratively architect and enable data center migrations into Google Cloud. Over the years, we’ve participated in multiple migration journeys, and devised a general approach.

Three ways tight integration makes logging and monitoring easier

Driving productivity of software development and delivery teams is critical for any organization. The six years of research by DevOps Research and Assessment (DORA) showcases the role easy-to-use tooling plays in driving this productivity and in turn a better work/life balance for the team. The research finds that highest performing teams are 1.5x more likely to have tools they consider easy to use.

Avoid cost overruns: How to manage your quotas programmatically

One important aspect of managing a cloud environment is setting up financial governance to safeguard against budget overruns. Fortunately, Google Cloud lets you set quotas for a variety of services, which can play a key role in establishing guardrails—and protect against unforeseen cost spikes. And to help you set and manage quotas programmatically, we’re pleased to announce that the Service Usage API now supports quota limits in Preview.

Take the first step toward SRE with Cloud Operations Sandbox

At Google Cloud, we strive to bring Site Reliability Engineering (SRE) culture to our customers not only through training on organizational best practices, but also with the tools you need to run successful cloud services. Part and parcel of that is comprehensive observability tooling—logging, monitoring, tracing, profiling and debugging—which can help you troubleshoot production issues faster, increase release velocity and improve service reliability.

Cloud Profiler provides app performance insights, without the overhead

Do you have an application that’s a little… sluggish? Cloud Profiler, Google Cloud’s continuous application profiling tool, can quickly find poor performing code that slows your app performance and drives up your compute bill. In fact, by helping you find the source of memory leaks and other errors, Profiler has helped some of Google Cloud’s largest accounts reduce their CPU consumption by double-digit percentage points.

How Cloud Operations helps users of Wix's Velo development platform provide a better customer experience

With more and more businesses moving online, and homegrown entrepreneurs spinning up new online apps, they’re increasingly looking for an online development platform to help them easily build and deploy their sites.

Find logs fast with new "tail -f" functionality in Cloud Logging

When you’re troubleshooting an app or a deployment, every second counts! Cloud Logging helps you troubleshoot by aggregating logs from across Google Cloud, on-premises or other clouds, indexing, aggregating logs into metrics, scanning for unique errors with Error Reporting and making logs available for search, all in less than a minute. And now, we’ve built two new features for streaming logs to give you even fresher insights from your logs data.

Introducing Monitoring Query Language, now GA in Cloud Monitoring

Developers and operators on IT and development teams want powerful metric querying, analysis, charting, and alerting capabilities to troubleshoot outages, perform root cause analysis, create custom SLI / SLOs, reports and analytics, set up complex alert logic, and more. So today we’re excited to announce the General Availability of Monitoring Query Language (MQL) in Cloud Monitoring! MQL represents a decade of learnings and improvements on Google’s internal metric query language.