Don't measure reliability with a lagging indicator like downtime or MTTR
Your reliability measurement can't just be a lagging indicator.
"How do you know your company is doing well at reliability?
A lot of people will just look at how many outages have you had in the last year and how much customer pain have you caused?
I think that's one side of the coin. That's the reactive lagging indicator of the health of our system. To really be good at this, we need a way to understand the risks and the sharp points so that we have an idea of what we're getting into.
And ultimately show that not just are the lagging indicators going down, but the leading indicators are going down, which gives us confidence that we are in control of the problem and we are making forward progress.
We're not just reacting and we didn't just get lucky."
—Kolton Andrus, Gremlin CTO