%term

The latest News and Information on Service Reliability Engineering and related technologies.

Why SREs Need to Embrace Chaos Engineering

Jul 20, 2022 By xMatters In xMatters

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Read Post

xMatters

Read more about Why SREs Need to Embrace Chaos Engineering

Top 12 Site Reliability Engineering (SRE) Tools

Jul 20, 2022 By Eyal Katz In Lightrun

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.

Read Post

Lightrun

Read more about Top 12 Site Reliability Engineering (SRE) Tools

Monitoring Your Platform From Multiple Locations

Jul 14, 2022 By Andrei Danilov In Rootly

Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers. One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world.

Read Post

Rootly

Read more about Monitoring Your Platform From Multiple Locations

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Jul 12, 2022 By Vishal Padghan In Squadcast

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.

Read Post

Squadcast

Read more about Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Top Five Pitfalls of On-Call Scheduling

Jun 30, 2022 By Squadcast Community In Squadcast

On-call schedules ensure that there's someone available day and night to fix or escalate any issues that arise. Using an on-call schedule helps keep things running smoothly. These on-call workers can be anyone from nurses and doctors required to respond to emergencies to IT and software engineering staff who need to fix service outages or significant bugs. Being on-call can be challenging and stressful. But with the proper practices in place, on-call schedules can fit well into an employee's work-life balance while still meeting the organization's needs.

Read Post

Squadcast

Read more about Top Five Pitfalls of On-Call Scheduling

Why More Incidents Are Better

Jun 30, 2022 By Andre King In Rootly

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be “zero.” After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing. Reducing actual incidents by as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number one enemy.

Read Post

Rootly

Read more about Why More Incidents Are Better

Are you doing SRE wrong? 4 questions to ask

Jun 29, 2022 By Auri Poso In Aiven

SRE requires teamwork and planning. Be like Aiven, get it right.

Read Post

Aiven

Read more about Are you doing SRE wrong? 4 questions to ask

Distributed Caching on Cloud

Jun 27, 2022 By Rajiv Srivastava In Squadcast

Distributed caching is an important aspect of cloud based applications, be it for on-premises, public or hybrid cloud environments. It facilitates incremental scaling, allowing the cache to grow and incorporate the data growth. In this blog we will explore distributed caching on cloud and why it is useful for environments with high data volume and load.

Read Post

Squadcast

Read more about Distributed Caching on Cloud

Lightstep Notebooks helps speed troubleshooting for SREs and developers

Jun 27, 2022 By Ben Sigelman In ServiceNow

Digital business is an imperative for 21st-century companies. Increasingly, organizations are directing investments toward technologies that deliver outcomes fast and enable more resilient digital business models. In this landscape, incidents such as software bugs, power outages, or downed networks have major consequences that affect both revenue and customer loyalty.

Read Post

ServiceNow

Read more about Lightstep Notebooks helps speed troubleshooting for SREs and developers

How To Prepare for a Site Reliability Engineer (SRE) Interview

Jun 27, 2022 By Stephen Watts In Splunk

Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.” Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems.

Read Post