Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

SRE Leaders Panel: Managing Systems Complexity

In our previous panel, we spoke about how to overcome imposter syndrome in high tempo situations, and how culture directly affects the availability of our systems. Building on that last discussion, we gathered leading minds in the resilience industry to discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.

Moogsoft Express Helps DevOps and SRE Teams Develop More and Operate Less

“Welcome to Tomorrowland.” That’s how Moogsoft Chairman and CEO Phil Tee kicked off the launch event of Moogsoft Express, the next-generation AIOps and observability solution built from the ground up for DevOps and SRE teams. The reference to a better future is fitting. With its arrival, Moogsoft Express helps these teams maintain visibility and control over increasingly complex CI/CD pipelines, so they can detect issues earlier, fix them faster and prevent outages.

SRE: A Human Approach to Systems

In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high. The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met.

Catchpoint's SRE Report 2020 - The Highlights

Our 2020 SRE Report is ready! We launched the SRE survey 2020 this January with the goal of understanding the current state of SRE. The survey covered a range of topics including: As we neared the end of the survey period, the SRE community was in the midst of a sudden change. SRE teams were forced to migrate to all-remote IT. We realized we would not be able to provide an accurate analysis without considering this shift in how SRE teams were operating in this new environment.

SRE Leaders Panel: Work as Done vs Work as Imagined

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion. Our panelists discussed the effects of imposter syndrome especially during high tempo situations, how to use it to our advantage and overcome doubt, and how culture directly affects the availability of our systems. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Web Monitoring Dashboards | The SRE's Ultimate Multi-Tool

It’s 3 AM and you are roused out of sleep by the dull buzzing of your phone in the other room. Some sort of emergency, you conclude as you fumble with the lockscreen. There it is: an alert that the API governing user registration is acting up. When we think about the lag between time of incident and time to respond, it’s not just about how long the system went down. How long it physically takes us to respond to the problem also contributes to lost downtime.