Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Top Industry Performers in Unplanned Server Downtime | Q1 The Uptime Report

Can you be incompetent and still stay in business? Not as far as your web infrastructure is concerned. All the studies show that when a website is unavailable, or even just slow to load, customers go elsewhere—and often they don’t come back. After all, if you can’t keep a website up and running, why should people trust you to deliver any other product or service? So it’s worth asking: how reliable is your website relative to the top brands in your industry?

April 2020 Outage Report

We will always remember April 2020 as the month that a DDoS attack took the world’s most expensive bottle of whiskey offline. We barely knew ye. Dateline: April 2020, the world’s most expensive whiskey auction is taken offline by DDoS. But other notable outages taught us a lot about which threats dominate our landscape. Namely DDoS attacks, which are highlighting vulnerabilities organizations have with redundancy and threat mitigation.

How to Use Uptime Monitoring tools to Check Website Uptime

One of the central questions we ponder in our work is: what does uptime mean in an interconnected world? You can do everything to ensure 100% reliability, yet still fail. How is this possible in an interconnected world? Shouldn’t there be enough redundancy to ensure nothing breaks if you don’t actively break it? That’s another way of saying technology is great when it works.

What Does 99.9% Uptime Mean?

An old adage about choosing a hosting provider says that everyone promises 99.9% uptime so you need to test uptime of a site for the real picture. Or scour the forums for reviews and judge for yourself how reliable they are. That works too. What that saying is really getting at is the need for some kind of indicator that uptime does not fall below expectations, because you can’t just trust the word of the provider when your business is at stake.

March 2020 Outage Report

It’s pretty safe to say that March was the month where everything changed for most of us. By now, enough has been said on coronavirus and we need not add to the pile. Our concern remains continuous uptime, and reporting on outages as teachable moments. During this time of heightened tensions, let’s take a few moments to do some post mortem work and see what we can learn from March’s outages.

The Uptime.com Report for 2019

Unplanned downtime can drive significant losses in the form of unrealized revenue. Teams may be caught off guard, or may face an outage outside their control, extending downtime hours unnecessarily. Without automated monitoring and alerting, teams face undetected outages that silently threaten SLA fulfillment. The recommendations in this report are best used as a guide on what trends may drive Site Reliability Engineering in the near term.

IT and DevOps Resources for COVID-19

We’re all wrestling with less than ideal circumstances during the pandemic of COVID-19. Whether you’re sheltering in place or simply practicing social distancing, it’s safe to say we’re all adjusting to a temporary new normal. One commonality is the need for connectivity. If infrastructure fails, business will screech to a halt and we will find ourselves in a new kind of mess altogether.

What Makes SSL Fail, and What Can SREs Do About It?

TLS (and the previously used SSL) protocols make the web go round. They are fundamental when establishing a link between two computers, creating a very special mathematical relationship signified by the all-encompassing gesture of friendship: the handshake. So fundamental, in fact, that we probably take them for granted when we shouldn’t. The user relies on TLS encryption every day to protect data and the integrity of a session.

February 2020 Downtime Report

February kicked 2020 off with a terrifying glimpse into what happens when the Internet of Things stops Internetting things. If we consider our central question this year of uptime in the age of always-connected, then we start to see the impact of hidden failures. All the stuff we don’t know we know impacts the end-user. Someone forgets to renew a TLS certificate, half the business world can’t collaborate. Someone else flubs an update?

How to Improve Downtime Response: Error Budgeting and Unplanned Downtime

Every one of us reading this blog has seen a fire spring up and quietly walked away from the impending chaos. And everyone one of us has managed to live this long because we understand when to react to a fire. A real fire affects our Service Level Objectives (SLO), and affects the user base. You need to figure out where it is, what started it, and what your team will do about it, and you need to do that now.