%term

The latest News and Information on Service Reliability Engineering and related technologies.

Show character with Blameless Postmortems (part one)

Apr 4, 2022 By Dave Harrison In Raygun

This is Part 1 of a two-part series on Blameless Postmortems. Today, we'll discuss why blameless postmortems are so important and their implications for your team; the second part will go into detail on how to set them up as a process and make them successful. Somebody wise may have once told you that how we handle adversity shows our character. Being able to acknowledge and admit mistakes is the first step towards learning - it's a key part of success both in personal relationships and in large companies.

Read Post

Raygun

Read more about Show character with Blameless Postmortems (part one)

New StackPod Episode: Implementing an SRE Practice with Yousef Sedky of Axiom/Hyke

Mar 31, 2022 By Annerieke Kortier In StackState

For our latest StackPod episode, we invited Hyke’s DevOps team lead and AWS Cloud architect: Yousef Sedky. Axiom Telecom is one of the largest telephone retailers in the United Arab Emirates and Saudi Arabia and Hyke, its sister company, is a distribution platform for mobile products.

Read Post

StackState

Read more about New StackPod Episode: Implementing an SRE Practice with Yousef Sedky of Axiom/Hyke

SRE vs. Platform Engineering: The Key Differences, Explained

Mar 29, 2022 By JP Cheung In Rootly

Site Reliability Engineering (SRE) teams and Platform Engineering teams share similar goals -- like maximizing automation and reducing toil -- and similar methodologies. But they have different priorities, and use somewhat different tools to achieve them. What are SREs, what are platform engineers and how is each role similar and different? This article explains.

Read Post

Rootly

Read more about SRE vs. Platform Engineering: The Key Differences, Explained

How important is Observability for SRE?

Mar 27, 2022 By Ricardo Castro In Squadcast

Observability is what defines a strong SRE team. In this blog, we have covered the importance of observability, and how SREs can leverage it to enhance their business. Observability is the practice of assessing a system's internal state by observing its external outputs. Through instrumentation, systems can provide telemetry such as metrics, traces, and logs that help organizations better understand, debug, maintain and evolve their platforms.

Read Post

Squadcast

Read more about How important is Observability for SRE?

Rundeck + Squadcast Integration: Simplifying Alert Routing

Mar 25, 2022 By Vishal Padghan In Squadcast

Rundeck is an automation tool that helps to make existing automation, scripts, and commands more secure, auditable, and easier to run. It is a software Job scheduler and Run Book Automation system that automates routine processes across development and production environments. It brings together tasks scheduling, multi-node command execution, workflow orchestration. It also logs everything that happens in the system. Squadcast is an end-to-end incident response tool.

Read Post

Squadcast

Read more about Rundeck + Squadcast Integration: Simplifying Alert Routing

SolarWinds Orion + Squadcast: Alert Routing Made Easy

Mar 24, 2022 By Vishal Padghan In Squadcast

SolarWinds Orion is a scalable infrastructure monitoring and management platform. It is designed to simplify IT administration for on-premises, hybrid, and software as a service (SaaS) environments, in a single pane of glass. SolarWinds Orion ensures you do not have to struggle with numerous incompatible point monitoring products, as it consolidates the full suite of monitoring capabilities into one platform with cross-stack integrated functionality. Squadcast is an end-to-end incident response tool.

Read Post

Squadcast

Read more about SolarWinds Orion + Squadcast: Alert Routing Made Easy

What Is Site Reliability Engineering (SRE)? The SRE Role Explained

Mar 22, 2022 By Joey D'Antoni In SolarWinds

Historically, there was a clear delineation between what system administrators (SysAdmins) do and what application developers are responsible for in IT organizations. In recent years—especially in organizations focused on software development—these worlds have come together as IT operations and development teams adopt DevOps practices. The concept of site reliability engineering (SRE) was first introduced by a much-discussed book titled Site Reliability Engineering from Google.

Read Post

SolarWinds

Read more about What Is Site Reliability Engineering (SRE)? The SRE Role Explained

SRE Revisited: SLO in the age of Microservices

Mar 18, 2022 By Dotan Horovits In logz.io

Site Reliability Engineering (SRE) practice was established by Google nearly 20 years ago, and was popularized with Google’s monumental SRE Book. Everyone’s been attempting to follow that iconic path ever since.

Read Post

logz.io

Read more about SRE Revisited: SLO in the age of Microservices

Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

Mar 18, 2022 By Vishal Padghan In Squadcast

Honeycomb is an application monitoring tool that helps DevOps and SRE teams to operate more efficiently by offering rich observability solutions and intuitive team collaboration. It helps understand complex relationships within your distributed systems and troubleshoot issues accordingly. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities.

Read Post

Squadcast

Read more about Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

SRE Metrics: Four Golden Signals of Monitoring

Mar 18, 2022 By Stephen Watts In Splunk

SRE (site reliability engineering) is a discipline used by software engineering and IT teams to proactively build and maintain more reliable services. SRE is a functional way to apply software development solutions to IT operations problems. From IT monitoring to software delivery to incident response – site reliability engineers are focused on building and monitoring anything in production that improves service resiliency without harming development speed.

Read Post