Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

iGaming: Where Incident Management Meets Compliance

At times when players have multiple online choices and competition is fierce, safe betting and social responsibility is at the forefront of brand integrity. In fact, social responsibility has become a competitive edge for leading operators. Enter the era of the regulator. Regulation is now defining both the operator’s brand integrity and the player experience. Are online operators up to the regulation task? Some are, though some are not.

Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.

Humanizing a DevOps Transformation

Anyone who’s ever played the game of chess knows there’s more than one way to reach a desired outcome. There are 400 possible setups after the first turn; 197,742 after the second; and just north of 120 million after the third—all of which are marching toward the same desired outcome. “So, what does any of this have to do with DevOps?” you ask? Fair question.

Effective Communication Between Healthcare Professionals - Best Practices

Effective communication between healthcare professionals is critical for timely and effective operations. In a modern healthcare environment, communication technologies are critical for connecting healthcare professionals with other caretakers and healthcare entities, ensuring the best, most effective, immediate care to patients.

Choosing the Right SRE Tools

Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

I Have An SLO. Now What? -Alex Hidalgo

It’s 2020: There is a plethora of data available about measuring SLIs and setting SLO targets. But, now that you have this data, what are you actually supposed to do with it? The classic example of “Ship features when you have error budget; focus on reliability when you don’t.” is antiquated, too simple, and ignores all of the amazing discussions and decisions you can have with your SLO data. Let’s talk about how you can use SLOs to actually make people happier — from your customers, to your engineers, to your business.

Look Upstream to Solve your Team's Reliability Issues

In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.

Keeping your teams and customers in the loop during downtime

Making your organization more transparent is not always an easy process. In our latest blog post, Adam Hammond, shares some tips and tools that can help you get started when it comes to keeping your teams and customers in the loop during downtime.The core message is that you need to make communication a cultural pillar of your organization.