Operations | Monitoring | ITSM | DevOps | Cloud

uptime

IT and DevOps Resources for COVID-19

We’re all wrestling with less than ideal circumstances during the pandemic of COVID-19. Whether you’re sheltering in place or simply practicing social distancing, it’s safe to say we’re all adjusting to a temporary new normal. One commonality is the need for connectivity. If infrastructure fails, business will screech to a halt and we will find ourselves in a new kind of mess altogether.

What Makes SSL Fail, and What Can SREs Do About It?

TLS (and the previously used SSL) protocols make the web go round. They are fundamental when establishing a link between two computers, creating a very special mathematical relationship signified by the all-encompassing gesture of friendship: the handshake. So fundamental, in fact, that we probably take them for granted when we shouldn’t. The user relies on TLS encryption every day to protect data and the integrity of a session.

February 2020 Downtime Report

February kicked 2020 off with a terrifying glimpse into what happens when the Internet of Things stops Internetting things. If we consider our central question this year of uptime in the age of always-connected, then we start to see the impact of hidden failures. All the stuff we don’t know we know impacts the end-user. Someone forgets to renew a TLS certificate, half the business world can’t collaborate. Someone else flubs an update?

How to Improve Downtime Response: Error Budgeting and Unplanned Downtime

Every one of us reading this blog has seen a fire spring up and quietly walked away from the impending chaos. And everyone one of us has managed to live this long because we understand when to react to a fire. A real fire affects our Service Level Objectives (SLO), and affects the user base. You need to figure out where it is, what started it, and what your team will do about it, and you need to do that now.

Why Your Status Page Matters and How to Use It

When an outage hits your service, everybody starts talking. Your engineers are talking about what caused the problem, and how to fix it; your management is asking about when it’ll be fixed; and your customers are telling the world that they’re not happy. But there’s an even more important conversation you should be having: communicating with your users about the issue.

January 2020 Outage Report

Welcome to 2020, where Google Drive can fail for some of you but not others, you can’t access your passwords, and you can’t withdraw cash on vacation. This stranded on a desert isle dream was reality in the month of January, which saw drama in the financial services and internet infrastructure sectors. January’s downtime reinforces just how connected we have become, and how reliant we are on infrastructure that can seemingly fail on a whim.

Transaction Monitoring | Upgrades and Use Cases in 2020

Synthetic monitoring takes care of all of the small interactions on our website that QA can’t catch. If you’re building an application for the web, a transaction check is an integral part of proactive downtime resolution. What we call transaction monitoring, or a transaction check, is a set of instructions that a probe server follows.

Got Game? Secrets of Great Incident Management

When his phone wakes him at two in the morning, operations engineer Andy Pearson knows it’s bad news. There’s a major server problem, and hundreds of client websites are down. Automated monitoring checks detected the outage within seconds, and paged the on-call engineer. This time, it’s Pearson in the hot seat. Pearson quickly confirms the issue is real and, escalates it to his boss, tech lead Lewis Carey.

December 2019 Outage Report

December was a busy month with systems we don’t normally see experiencing server downtime. Our first story, in particular, is an excellent example of how complicated monitoring can get as infrastructure grows. We saw every level get hit, from the government to big-name players, with ransomware being one of the major thorns in our collective sides. But we also bring you the heartwarming tale of the little Minecraft Server that could.

What We Learned About Uptime from 2019 Website Outages

One thing we’ve always known: there’s no such thing as 100% uptime for any website. Too many variables are at play to keep a site from staying up all the time. From traffic surges to hardware failures and everything in between, keeping sites up and running is a full-time job for SREs and IT pros. Here at Uptime.com, we track major downtime all year long to provide websites of all sizes with lessons in how to catch downtime and resolve incidents quickly.