Best practices when managing an outage
There’s never a good time for a service outage. And, from the moment it hits, it starts affecting your stakeholders. Suddenly, essential daily tasks are curtailed while your team enters emergency response mode. However, the surest way to mitigate damages and recover quickly is to follow a set of best practices.
It’s far better to plan for an outage. But if you wait until it happens before you start developing a response, you will be far behind where you need to be for a quick resolution. This guide will help you create a set of best practices for your organization. This will help you work toward faster and more effective responses.
Sudden, unexpected outages always increase the stress level of those responsible for fixing the problem. The ability to stay composed in an emergency is a great trait. Some are born with it. However, most must learn to remain relaxed while the world burns down.
A thoroughly developed and well-practiced plan helps you transition into response mode without wasting time panicking.
First thing, establish an Incident Commander (IC). The IC is the one person in charge of all aspects of response management. In addition, the IC is responsible for breaking down departmental divides during a crisis.
Few things increase the anxiety surrounding an outage as interdepartmental bickering. It must be understood that different rules are in effect during an emergency. Your team must be on board with the priority of the response plan.
Once your team knows there is a problem, the first decision you must make is when to communicate with stakeholders. For example, hastily sending an emergency outage announcement could cause undue alarm. Alternately, waiting too long to announce could make customers feel neglected.
Timing is everything. You will need to communicate that your team is aware of the issue and that you understand the size of its impact. Also, convey that your team is working to resolve it quickly. However, you may have to inform them that you are still investigating and determining what caused the problem. This is okay, but be sure to convey that your team thoroughly understands the system and will work rapidly to find and fix the problem.
It is best to have your communications channels pre-established for internal and external stakeholders. Use text-based, channel-oriented communication methods like Slack or Microsoft Teams for your team. These platforms will provide a typed record of all internal interactions during the outage response.
You can use set-aside channels on an open platform like Twitter for external communications. This will help ensure that interested stakeholders won’t have barriers to deal with as they keep up with the latest developments.
Keeping channels designated for specific incident communications topics will keep each channel flowing with pertinent information regardless of who enters the chat.
Find the source of the outage.
Outages can occur due to assorted reasons. However, they typically fall into two categories: internal and external.
On the internal side, programming errors in internal apps or databases, network hardware issues, or network misconfigurations may be the root of the problem. Understanding internally caused outages requires you and your team to understand thoroughly how your system operates. In addition, it requires continual training and practice exercises to keep team members up-to-date and confident.
More commonly, these days, the source of trouble could be external. For example, your system’s cloud services could be out, causing your incident. In this case, a Service Licensing Agreement (SLA) will likely cover your outage, allowing you to outsource fixing the issue.
However, to enact the SLA, you must be able to identify the external plug-in or app that caused the outage. In the past, this process involved going to each app’s status page and checking for outages that might affect your network. This could be a rather extensive procedure depending on the number of external apps you use. The biggest drawback was the time you lost trying to find the initial source of the problem. In addition, it typically involved sifting through multiple pages unrelated to your situation.
IsDown changes all that. The IsDown monitoring dashboard consolidates all your critical services status pages in one easily accessible place.
Even better, you receive instant alerts when one of your cloud services experiences an issue that potentially affects you. As a result, you can act quickly and confidently when you know what caused the outage. It’s possible to solve the problem before most stakeholders even realize an issue exists.
Escalate if needed.
If the outage will significantly impact your stakeholders, it is time to elevate the incident’s priority level. You will need to assign the right staff based on the specifics of the incident. Thinking about required roles instead of individuals with specific job titles is helpful.
In other words, assign the most qualified and available staff member as IC. In addition, depending on the size of your organization, and the number of affected stakeholders, you may need to designate a communications leader to help the IC coordinate timing and tone of the internal and external messages sent.
Additionally, the response team may need to include one or more subject matter experts. These are typically engineers whose sole focus should be an immediate fix for the technical problems causing the outage.
If available, appointing a team member as the documentarian for the incident can be helpful. The individual would record on a timeline everything that happens from the first moment to the end of the outage.
Update stakeholders continually.
Keeping customers updated is vital to a successful response. Make sure they have access to the appropriate chat channel for real-time status reports. Also, keep your webpage chronicling the incident current and accessible. Finally, check all your communications to ensure they express the right tone of urgency with plenty of confidence that the problem will be short-lived.
Don’t skip the post-mortem.
Once the incident is resolved, a forensic look at what happened is invaluable. You can better understand how your system functions by rehearsing what took place, what worked, and what didn’t. In addition, a thoroughly conducted post-mortem will provide the basis for updating your response plan for the next time.
Looking back will help you perfect your future responses and better care for your customers and stakeholders.
And remember, you can quickly identify external outages that impact your business with IsDown. Get started for free and find the missing layer in your monitoring stack.