Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Why 'owning Services' is critical for effective Incident Response

There is a famous quote that goes like this…‘For every minute spent organizing, an hour is earned.’ At least in the world of incident response, nothing is more apt than this. Digital infrastructure these days is made up of multiple services, an outage could result from either one impacted service or multiple impacted services. So it's essential to have a catalog of all the services along with the point of contact (service owner) responsible for maintaining it.

On Building a Platform Team

It may surprise you to hear, but Honeycomb doesn’t currently have a platform team. We have a platform org, and my title is Director of Platform Engineering. We have engineers doing platform work. And, we even have an SRE team and a core services team. But a platform team? Nope. I’ve been thinking about what it might mean to build a platform team up from scratch—a situation some of you may also be in—and it led me to asking crucial questions. What should such a team own?

Routing alerts from AWS Elastic Beanstalk via CloudWatch

Amazon Web Services (AWS) offers 100+ services, each focusing on a specific area of functionality. However, it can be challenging to pick the right services for the task and also to provision them. AWS Elastic Beanstalk, lets you easily deploy and manage applications without the need to learn about the underlying infrastructure that runs these applications.
Sponsored Post

Introduction to Automation Testing Strategies For Microservices

Microservices are distributed applications deployed in different environments and could be developed in different programming languages having different databases with too many internal and external communications. A microservice architecture is dependent on multiple interdependent applications for its end-to-end functionalities. This complex microservices architecture requires a systematic testing strategy to ensure end-to-end (E2E) testing for any given use case. In this blog, we will discuss some of the most adopted automation testing strategies for microservices and to do that we will use the testing triangle approach.

Authors' Cut-Gear up! Exploring the Broader Observability Ecosystem of Cloud-Native, DevOps, and SRE

You know that old adage about not seeing the forest for the trees? In our Authors’ Cut series, we’ve been looking at the trees that make up the observability forest—among them, CI/CD pipelines, Service Level Objectives, and the Core Analysis Loop. Today, I'd like to step back and take a look at how observability fits into the broader technical and cultural shifts in technology: cloud-native, DevOps, and SRE.

SRE Fundamentals: Everything you need to know

Google has had an outsized impact on the world, from its unrivaled search engine to its expansion into a range of customer-focused services. It would be difficult to make an impact of this magnitude without also leading the way in the software development industry. One of its biggest contributions to the community is a set of principles known as site reliability engineering or SRE.