We explain what infrastructure monitoring is, how it works, how to overcome the challenges in complex systems, best practices for monitoring, and the tools you need.
As users expect incidents and outages to be addressed as quickly as possible, any time of day, on-call rotations have become necessary for SRE and DevOps teams. How do you create on-call rotations schedules that are fair and reduce burnout?
Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.
At Blameless, the world’s leading software engineering teams rely on us during incident management. A key part of our offering is the ability to seamlessly integrate with a customer’s unique tech stack. As such, we value partnerships with companies like Microsoft that enhance our user experience and meet the needs of our customers. We understand how essential it is to integrate with communication tools like Microsoft Teams, because it’s the first place a user goes to start an incident.
Incidents happen, so how do you handle them? We explain incident management, how to prioritize incidents, and the process involved to resolve the incident.
Software metrics give important insight into the performance of your product, but which ones matter most to SRE teams? How do you decide which metrics to track?