Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Sponsored Post

Exploring PagerDuty Alternatives for Incident Response

Incident response refers to effectively responding to infrastructure issues and resolving them in the shortest time frame possible. Due to several loss-inducing high-profile outages over the last few years, organizations have sought to create rigorous processes with specialized tools to resolve incidents quickly and learn from their failures. As one of the first platforms to enter the incident response space, PagerDuty is a dominant player, but over the years, competing platforms have begun carving out their own niche in the incident response space.

Sponsored Post

The Importance of Observability for Site Reliability Engineers (SREs)

Site reliability engineers (SREs) play a crucial role in ensuring the reliability of systems. From creating software to improving system reliability in production, responding to incidents, and fixing issues, SREs are responsible for guaranteeing the health of applications.. And observability helps support SREs'. Because an observable system allows them to identify and fix issues promptly, resulting in SRE's being better equipped to fast-track development cycles.

Tips to make your Retrospectives Meaningful

If done right, retrospectives can help you inspect past actions, help adapt to future requirements and guide teams towards continuous improvement. However, organizations find it difficult to adopt the right mindset to execute retrospectives effectively. This blog will help you understand what retrospectives are and provide valuable tips to make your retrospectives meaningful. This blog will cover,

Introducing Webforms - Involve end users directly into your Incident Management process

Over the years we’ve received requests from our customers for a feature that can enable their customers and their end users to create/ report incidents directly on Squadcast. To our valued customers - we heard you! We are excited to introduce Webforms to do exactly that. In the past, we’ve addressed the challenges pertaining to On-call processes and best practices that teams can implement.

What's difficult about problem detection? - Three Key Takeaways

Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection. ‍

Managing Squadcast resources with our expanded Terraform provider

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

Blameless Expands Microsoft Partnership to Deliver Faster, More Intuitive Incident Response Collaboration

At Blameless, the world’s leading software engineering teams rely on us during incident management. A key part of our offering is the ability to seamlessly integrate with a customer’s unique tech stack. As such, we value partnerships with companies like Microsoft that enhance our user experience and meet the needs of our customers. We understand how essential it is to integrate with communication tools like Microsoft Teams, because it’s the first place a user goes to start an incident.

Using Observability with Kubernetes to Automate Site Reliability Engineering

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!