|
By Sorrel Harriet
While retrospectives provide a valuable pathway for learning outside of the flow of work, we also want learning to happen during an incident or unexpected event as it unfolds. This can be challenging due to the negative impact of stress on our ability to learn and navigate difficult situations. In this article, we’ll dig into how stress inhibits our ability to learn and what we can do about it.
|
By Iryna Iurchenko
For better or worse, most companies—including their execs and developers—see SREs as superheroes who’ll save them from the evils of downtime and service degradation with their boundless superpowers. SREs are expected to constantly perform dangerous stunts like production debugging or communicating highly technical issues to angry VPs. They must also be able to manage infrastructure, networks, databases, pipelines, operating systems and much more.
|
By Ashley Sawatsky
Pets.com was an online pet supply retailer founded in 1998, during the dot-com craze. In February 2000, it raised $83 million to go public based mainly on metrics like user acquisition, website traffic, and brand recognition. However, the profit margins were minimal and the marketing costs exorbitant, which led Pets.com to file for bankruptcy nine months after its IPO. The industry now recognizes these metrics as vanity metrics.
|
By Ashley Sawatsky
The concept of Round Robin comes from sports. And it has nothing to do with anyone called Robin, but the french word ruban (ribbon). In a Round Robin tournament, all participants face each other by taking turns. When applied to on-call schedules, a Round Robin escalation policy means that responders assigned to a level will take turns responding to alerts. When is this strategy useful and when isn’t?
|
By JJ Tang
The world population in 2024 is approximately 8.12 billion people. Of these, 4.3 billion people use Google regularly, while 3.74 billion are active users on Meta's platforms. Any disturbance involving these tech giants will surely make headlines, as seen in the recent Google’s Unisuper incident. The scale of these tech companies brings fascinating challenges in every aspect of their operations, including incident response.
|
By Ashley Sawatsky
Earlier this month, an inadvertent misconfiguration in an internal tool used by Google Cloud resulted in the deletion of a user’s GCVE Private Cloud. The user in question? UniSuper Australia — a $125 billion Australian pension fund with over 600,000 users. In this post, Ashley reflects on the communications shared and what we can learn from them.
|
By Jorge Lainfiesta
Use the different timezones and varied needs of your team to schedule on-call rotations that make everyone happy.
|
By Jorge Lainfiesta
No matter how good a new teammate is, a lot of their success is in your hands.
|
By Jorge Lainfiesta
Discover 5 models of compensation for on-call.
|
By Jorge Lainfiesta
What's the secret to achieving the right balance for your platform?
- August 2024 (1)
- July 2024 (7)
- May 2024 (2)
- April 2024 (5)
- March 2024 (4)
- February 2024 (3)
- January 2024 (2)
- December 2023 (2)
- November 2023 (1)
- October 2023 (1)
- September 2023 (2)
- August 2023 (4)
- July 2023 (2)
- February 2023 (1)
- January 2023 (1)
- October 2022 (1)
- July 2022 (1)
- June 2022 (1)
- May 2022 (2)
- April 2022 (1)
- March 2022 (3)
- February 2022 (3)
- January 2022 (4)
- December 2021 (3)
- November 2021 (4)
- October 2021 (5)
- September 2021 (3)
- August 2021 (4)
- July 2021 (5)
- June 2021 (3)
- May 2021 (4)
- April 2021 (5)
- February 2021 (1)
Rootly is a turnkey incident response command centre that brings the best reliability practices from Google, Netflix, Amazon to those without a million-dollar budget.
Rootly is an all-in-one platform that streamlines collaboration, communication, and learning. It automates away manual toil engineers suffer through today and captures data-driven insights. With Rootly, companies accelerate their incident resolution and learn how to prevent them in the future.
Teams depend on Rootly to improve their reliability:
- Collaborate: Seamlessly handoff alerts from PagerDuty to quickly declare incidents from your tool of choice like Slack. Automatically involve all the right teams in seconds, not minutes. Beyond just engineering but loop in legal, support, and sales. With intelligent workflows, no more wondering what team owns which service or who should be responsible for what. Rootly does the heavy lifting for you.
- Communicate: Build your incident timeline through Web or Slack. Autolink war rooms with our Zoom & Google Meet integrations. Rich and customizable private and public status pages ensure everyone is updated while you focus on what you do best, fighting fires.
- Remediate: Enrich your timeline with automated Genius workflows. Fetch relevant information as recent git commits of your impacted services. Customize your workflows based on any incident condition.
- Retrospective: Learn from incidents with beautiful postmortems engineers want to write without the manual toil of copy and pasting. Accurately replay past incidents to help simulate real world disaster scenarios to train engineers faster and keep their tools sharp. Organized and easily shared, not buried in a Google Doc that can’t be found.
All-in-one incident response platform for humans.