Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

What's the Difference Between an Agile Retrospective and an Incident Retrospective?

Blameless Chief Operating Officer Ken Gavranovic recently sat down with Lee Atchison, a renowned expert in system reliability, to discuss the topic of conducting effective incident retrospectives. You can watch their engaging, informative discussion below, or read on for our overview of the greatest hits from their talk. ‍ Agile development and incident management are the backbones of any tech-driven development cycle. At the heart of these practices lies the art of retrospectives.

Elastic AI Assistant for Observability

Harness the power of generative AI to turn insights into actions. Powered by the Elasticsearch Relevance Engine™ (ESRE™), Elastic’s AI Assistant (in technical preview for Observability) transforms problem identification and resolution by eliminating manual data chasing across silos to an interactive assistant that delivers accurate and context-aware remediation for SREs.

Seven Models of Cloud Native Applications

In today's cloud-driven landscape, organizations are transitioning from legacy monolithic systems to agile, scalable, and secure cloud-native solutions. Some are even forging new cloud-native applications. However, the concept of cloud-native design remains subjective, lacking a universal blueprint. This blog aims to provide clarity and guidance for designing precise cloud-native applications and container deployment.

How to Set Up an IT War Room

IT issues can happen at any time and significantly impact an organization. Hence, it's essential to have a plan to handle these issues quickly and efficiently. And one way to do this is to create an IT war room. An IT war room is a dedicated space for teams to collaborate and resolve issues. Establishing an IT war room enhances an organization's capacity to swiftly and efficiently address IT problems, ultimately reducing their impact on the business.

Enhancing Incident Management: Seven Integrations to Complete Your Ticketing Systems

Squadcast offers some powerful integrations to simplify Incident Management processes and make your work easy. These integrations enhance Incident Management processes and complete your ticketing systems, ensuring seamless collaboration and timely issue resolution.

Practical guidance for getting started as a site reliability engineer

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move. With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.