Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How To Write Incident Postmortems

Writing a public postmortem regarding an outage is essential to maintaining transparency and accountability when things go wrong in a service or system. The purpose of writing a postmortem is to analyze and document an incident or event that has occurred, usually with a focus on identifying its root causes, understanding what went wrong, and outlining steps to prevent similar issues from happening in the future.

Rootly Raises $12 Million from Renegade Partners, Google Gradient Ventures, & XYZ Ventures

We are excited to announce that we have raised a $12M round of financing led by Renegade Partners with participation from Google Gradient Ventures (Google’s AI-focused venture fund) and XYZ Ventures. This brings our total funding to date to $15.2M ($20M CAD) alongside our other existing investors Y Combinator and 8VC.

SRE in Transition: From Startup to Enterprise

"Startups are defined by “ship or die”. As a result, SRE teams at a startup should be focused on enabling product engineers to ship features as quickly as possible. As your startup transitions from “we’ll run out of money in the next 18 months” to “we have more than 1000 engineers”, how should the SRE organization evolve and provide the best value through that transition (including booting one up if you don’t have one)? I will discuss specific ways the organization needs to evolve to meet this challenge, how the SRE org can advocate for and support this change (both in direct actions and in “influence”), and how the overhang of startup technical and cultural debt can make this shift more challenging (but also more necessary).

Tools and Trends in Site Reliability Engineering according to Gartner's 2023 Hype Cycle

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, report. This blog reviews the future of site reliability engineering based on Gartner’s Hype Cycle. Additionally, the OnPage team is pleased that Gartner mentioned OnPage as a sample vendor in the Automated Incident Response category.

Evolution of Site Reliability - Incidentally Reliable with Manoj Sebastian

Catch Manoj Sebastian(ex-Flipkart, Amazon, Atlassian, Intuit, Yahoo) talk about The Evolution of SRE through 20 years, Incident Response and Post Incident Culture at Big Tech and the Future of Reliability with AI ramping up at full speed. The freshest podcast for Site Reliability Engineers, hosted by Vishwa and Shubham from Zenduty.

Unveiling Squadcast's Enhanced Status Pages

Meet Kevin and Mai (again): Navigating the Troublesome Waters of Platform Downtime. Kevin is a Site Reliability Engineer (SRE), constantly on the lookout for potential downtime that could impact their platform, kryptobro.com. Mai is his adept partner, ever-ready to troubleshoot. In their journey, the previous version of Squadcast Status Pages served as a helpful tool, but they soon found room for improvements.

SRE Redefines IT Operations as Architect of Sustainable Systems

Site Reliability Engineering (SRE) is a term that’s getting attention and gaining momentum – and for a good reason. SRE takes features of software engineering and applies them to various problems in infrastructures and operations. Organizations look to build SRE teams with a couple goals in mind, including to create and increase scalability and develop solid software systems.