Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Empower the SREs - Conclusions from The SRE Report 2023

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems.

Introducing a more complete logs forwarding experience

One of the key attributes of DevOps and SRE engineers is their ability to meticulously observe and monitor all of their applications. A task which can be achieved more efficiently by centralizing all generated logs to a central endpoint. By centralizing logging, engineers can, at any time, have an accurate overview of all events which take place across their applications, from just one place. Storing logs in an external system also allows companies to ensure compliance with many certifications.

For incident management, should you build or buy?

Is your incident response held together by a thread? Are you manually recording incident updates in a shared doc? Do you struggle to juggle the incident management workload with your other responsibilities? Does everyone on-call report data the same way? These are all common problems faced by DevOps teams still relying on homegrown incident management tooling.

Service Level Management Process Explained (with Examples)

‍ Service Level Management, or SLM, is defined as the process of negotiating Service Level Agreements and ensuring that they are met. ‍ Service Level Management is a fundamental part of SRE and DevOps. It encompasses the expectations and perceptions that both the business and the customer have about the service and its performance. Service level management will include existing and new services as they are added, with the service level agreements (SLAs) being modified accordingly.

Why 'owning Services' is critical for effective Incident Response

There is a famous quote that goes like this…‘For every minute spent organizing, an hour is earned.’ At least in the world of incident response, nothing is more apt than this. Digital infrastructure these days is made up of multiple services, an outage could result from either one impacted service or multiple impacted services. So it's essential to have a catalog of all the services along with the point of contact (service owner) responsible for maintaining it.

On Building a Platform Team

It may surprise you to hear, but Honeycomb doesn’t currently have a platform team. We have a platform org, and my title is Director of Platform Engineering. We have engineers doing platform work. And, we even have an SRE team and a core services team. But a platform team? Nope. I’ve been thinking about what it might mean to build a platform team up from scratch—a situation some of you may also be in—and it led me to asking crucial questions. What should such a team own?

Routing alerts from AWS Elastic Beanstalk via CloudWatch

Amazon Web Services (AWS) offers 100+ services, each focusing on a specific area of functionality. However, it can be challenging to pick the right services for the task and also to provision them. AWS Elastic Beanstalk, lets you easily deploy and manage applications without the need to learn about the underlying infrastructure that runs these applications.