Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

We have Carson Anderson, Sr. DevOps Engineer at Weave HQ, talking about how they implemented SLOs using Prometheus, what went wrong, and how they fixed it. This talk was given at "Last9 of Reliability" Discord community on 13th December. Talk Description: First thing's first: Yes, it really did take us 5 tries to implement our SLOs with Prometheus. While that may seem embarrassing, we are very happy to be able to share our SLO journey so that we can hopefully help you avoid the same mistakes.

Introducing Squadcast's Intelligent Alert Grouping and Snooze Notifications

Maintaining system reliability amidst a deluge of alerts remains a formidable challenge for complex infrastructure environments. To address this critical need, Squadcast is happy to introduce Intelligent Alert Grouping - designed and developed based on in-depth discussions and feedback from our enterprise customers. This innovative solution is designed to streamline Incident Management, ensuring that Incident Response teams can focus on what truly matters.

How Squadcast's Workflows Enhance Incident Management Automation?

One of the daily challenges for Incident Response teams is the pressure to resolve incidents swiftly and effectively. However, manual processes often hinder this objective, leading to delays, oversight, and potential miscommunication. In this blog, we’ll learn the practical aspects of workflow automation in Incident Management using Squadcast, exploring how it streamlines processes, eliminates manual tasks, and enhances overall efficiency.

How to Calculate and Minimize Downtime Costs

Downtime is an unwelcome reality. But, beyond the immediate disruption, outages carry a significant financial burden, impacting revenue, customer satisfaction, and brand reputation. For SREs and IT professionals, understanding the cost of downtime is crucial to mitigating its impact and building a more resilient infrastructure.

Sponsored Post

Runbook vs Playbook: What's the difference?

What's the difference between Runbook and Playbook?- for once and all we'll end this confusion today. If you find yourself worrying about forgetting the detailed process of the incident your team just resolved, you're not alone. This is where documentations like Runbooks and Playbooks come into play. Runbooks and playbooks serve as the organizational guides, providing essential information and instructions for teams to navigate through tasks and processes effectively. They not only help your team help themselves but also frees up your time for your ever-growing to-do list.

2023 Rewind: Squadcast Year-End Review

Hold the confetti, everyone, because it's time to POP the champagne! 2023 was a year where Squadcast truly leveled up. We dropped some remarkable features that got our hearts racing (and alerts under control!), snagged some fantastic recognition for our impact, and even gave our website a stunning makeover. And we couldn't have done it without you! Buckle up to get a rewind of everything altogether, Let's get started.