Olly for SREs: 3 ways I actually use it in production
There’s a moment after an alert where you’re not fixing anything yet. You’re trying to answer a much simpler question: Is it actually down? Sometimes it’s obvious. Sometimes it’s 20 alerts at once with no clear starting point. Sometimes it’s a small upstream degradation that might cascade. Sometimes it’s just a spike that resolves on its own. That first phase is orientation. Is the signal real or transient? Is it isolated or spreading? Root cause or symptom?