Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Monitor kube-state-metrics v2.0 with Datadog

In order to manage complex containerized applications, modern devops teams need to have deep visibility into the status of their Kubernetes resources. By listening directly to the Kubernetes API, the open source kube-state-metrics service generates key metrics about your Kubernetes objects, including pods, nodes, and deployments, which are essential for understanding the status and performance of your clusters.

Top SRE Toolchain Used By Site Reliability Engineers

We have compiled a list of the most popular and sought out tools (some you may have heard of) that SREs need in their toolkit - at every phase of a production system to keep up with SRE best practices Site reliability engineering (SRE) practices help organizations by ensuring smooth functioning of their deliverables with utmost reliability and resilience. These can be achieved by a set of well-defined tools that are deployed at every phase of the production system to keep up with SRE best practices.

SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience. The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice.

Accelerating Code Quality with DORA Metrics

What do Google’s DevOps Research and Assessment (DORA) and Rollbar have to do with each other? DORA identified four key metrics to measure DevOps performance and identified four levels of DevOps performance from Low to Elite. One way for a team to become an Elite DevOps performer is by focusing on Continuous Code Improvement.

Diagnosing Database Performance Problems When You Aren't a Database Administrator

Deep specialization of IT administrators is a luxury only the largest organizations can typically afford. Smaller organizations rely on IT administrators with a more generalist skill set because they are—by necessity—responsible for a wide array of different technologies, and there simply isn’t time to specialize in the intricacies for any one of them. Yet modern IT is intricate.

Failover Conf 2021 Wrap-Up

That’s a wrap! Gremlin hosted Failover Conf 2: Fail Smarter on April 27, 2021. In attendance were over 500 SREs, developers, sales engineers, product managers, DevOps experts, C-level execs, and other reliability pros from around the globe! This year’s conference included discussions around the future of DevOps, strategies for building reliable teams, analyzing human error to create better systems, and more.

Cloud-Hosted of Cloud-Native? Discover Why Cloudsmith Was Born in the Cloud

Today, almost every service now is offered in a “Cloud” variant. But what does that really mean? Are all clouds services equal? It’s easy to see why so many vendors rush to add a Cloud edition/variant of established software they sell. Undoubtedly, there has been a move to Cloud services across the industry, as more and more organizations seek to take advantage of the higher reliability and lower total cost of ownership that Cloud platforms promise.

ICYMI: How Honeycomb Can Help You Achieve the Deployment Part of CI/CD

In case you missed it, this webinar includes code walkthroughs that help you to add observability to your pipelines (using a free Honeycomb account!) so that you and your team can speed up your deployments to prod. This is also a risk-free way to get started with observability if your team isn’t quite yet ready to change your production apps.