Latest Posts

How Stress Affects Our Learning Abilities in Incidents (And What To Do About It)

Aug 6, 2024 By Sorrel Harriet In Rootly

While retrospectives provide a valuable pathway for learning outside of the flow of work, we also want learning to happen during an incident or unexpected event as it unfolds. This can be challenging due to the negative impact of stress on our ability to learn and navigate difficult situations. In this article, we’ll dig into how stress inhibits our ability to learn and what we can do about it.

Read Post

Rootly

Read more about How Stress Affects Our Learning Abilities in Incidents (And What To Do About It)

The Best SRE Tools To Improve Reliability and Streamline Operations

Jul 31, 2024 By Iryna Iurchenko In Rootly

For better or worse, most companies—including their execs and developers—see SREs as superheroes who’ll save them from the evils of downtime and service degradation with their boundless superpowers. SREs are expected to constantly perform dangerous stunts like production debugging or communicating highly technical issues to angry VPs. They must also be able to manage infrastructure, networks, databases, pipelines, operating systems and much more.

Read Post

Rootly

Read more about The Best SRE Tools To Improve Reliability and Streamline Operations

Beyond MTTR: 7 incident metrics that matter and 3 that don't

Jul 24, 2024 By Ashley Sawatsky In Rootly

Pets.com was an online pet supply retailer founded in 1998, during the dot-com craze. In February 2000, it raised $83 million to go public based mainly on metrics like user acquisition, website traffic, and brand recognition. However, the profit margins were minimal and the marketing costs exorbitant, which led Pets.com to file for bankruptcy nine months after its IPO. The industry now recognizes these metrics as vanity metrics.

Read Post

Rootly

Read more about Beyond MTTR: 7 incident metrics that matter and 3 that don't

Round Robin escalation policies: do's and don'ts

Jul 9, 2024 By Ashley Sawatsky In Rootly

The concept of Round Robin comes from sports. And it has nothing to do with anyone called Robin, but the french word ruban (ribbon). In a Round Robin tournament, all participants face each other by taking turns. When applied to on-call schedules, a Round Robin escalation policy means that responders assigned to a level will take turns responding to alerts. When is this strategy useful and when isn’t?

Read Post

Rootly

Read more about Round Robin escalation policies: do's and don'ts

How Meta and Google use AI to improve incident response

Jul 2, 2024 By JJ Tang In Rootly

The world population in 2024 is approximately 8.12 billion people. Of these, 4.3 billion people use Google regularly, while 3.74 billion are active users on Meta's platforms. Any disturbance involving these tech giants will surely make headlines, as seen in the recent Google’s Unisuper incident. The scale of these tech companies brings fascinating challenges in every aspect of their operations, including incident response.

Read Post

Rootly

Read more about How Meta and Google use AI to improve incident response

What we can learn from Google's UniSuper incident comms

May 30, 2024 By Ashley Sawatsky In Rootly

Earlier this month, an inadvertent misconfiguration in an internal tool used by Google Cloud resulted in the deletion of a user’s GCVE Private Cloud. The user in question? UniSuper Australia — a $125 billion Australian pension fund with over 600,000 users. In this post, Ashley reflects on the communications shared and what we can learn from them.

Read Post