Inevitably, in the lifetime of a service or application, developers, DevOps, and SREs will need to investigate the cause of latency. Usually you will start by determining whether it is the application or the underlying infrastructure causing the latency. You have to look for signals that indicate the performance of those resources when the issue occured.
When you are experiencing an issue with your application or service, having deep visibility into both the infrastructure and the software powering your apps and services is critical. Most monitoring services provide insights at the Virtual Machine (VM) level, but few go further. To get a full picture of the state of your application or service, you need to know what processes are running on your infrastructure.
Keeping the experience of your end user in mind is important when developing applications. Observability tools help your team measure important performance indicators that are important to your users, like uptime. It’s generally a good practice to measure your service internally via metrics and logs which can give you indications of uptime, but an external signal is very useful as well, wherever feasible.
Troubleshooting production issues with virtual machines (VMs) can be complex and often requires correlating multiple data points and signals across infrastructure and application metrics, as well as raw logs. When your end users are experiencing latency, downtime, or errors, switching between different tools and UIs to perform a root cause analysis can slow your developers down.
When you’re troubleshooting an application on Google Kubernetes Engine (GKE), the more context that you have on the issue, the faster you can resolve it. For example, did the pod exceed it’s memory allocation? Was there a permissions error reserving the storage volume? Did a rogue regex in the app pin the CPU? All of these questions require developers and operators to build a lot of troubleshooting context.
Logs are an essential part of troubleshooting applications and services. However, ensuring your developers, DevOps, ITOps, and SRE teams have access to the logs they need, while accounting for operational tasks such as scaling up, access control, updates, and keeping your data compliant, can be challenging. To help you offload these operational tasks associated with running your own logging stack, we offer Cloud Logging.
Visualizing trends in your logs is critical when troubleshooting an issue with your application. Using the histogram in Logs Explorer, you can quickly visualize log volumes over time to help spot anomalies, detect when errors started and see a breakdown of log volumes. But static visualizations are not as helpful as having more options for customization during your investigations.