Outages ITOps professionals are thankful to avoid
As we settle into the time of year when we reflect on what we’re thankful for, we tend to focus on important basics such as health, family and friends. But on a professional level, IT operations (ITOps) practitioners are thankful to avoid disastrous outages that can cause confusion, frustration, lost revenue and damaged reputations. The very last thing ITOps, network operations center (NOC) or site reliability engineering (SRE) teams want while eating their turkey and enjoying time with family is to get paged about an outage. These can be extremely costly — $12,913 per minute, in fact, and up to $1.5 million per hour for larger organizations.
To understand the peace of mind that comes with avoiding downtime, however, you have to have endured the pain and anxiety that comes with outages first-hand. Here are a handful of the horror stories ITOps pros are thankful to avoid this season.
A case of janky command structure
One longtime IT pro was on a shift with three others as 7 p.m. rolled around. The crew received an alert about a problem impacting the front-end user interface for its global traffic manager device. Thankfully, there was a runbook for it housed in a database, so it appeared the problem would be resolved quickly. One of the team members saw two things to type in: A command and a secondary input. He typed in the commands and, based on the way the runbook looked, was waiting for the command line to ask for an input, such as “what do you want to restart?”
The way the command structure was set up, if you didn’t provide an input, the device itself would restart. He typed in what he thought was the correct command — “bigstart, restart” — and the entire front-end global traffic manager was taken down.
Just as a reminder, this took place in the early evening. The customer was a finance company, and the system went down just around the time when businesses were closing and trying to do their books and other finance-related tasks. Terrible timing, to say the least.
Five minutes into the outage, the ITOps team realized what happened: The tool they used for their runbook used text wrapping by default, so what looked like two separate commands was actually just one. Even though the outage was relatively short, it came at a critical time and created a chain reaction of headaches. The lesson learned? Ensure your command structure is optimized.