The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
According to the ITIL, the framework of best practices for delivering IT services, there is a recommended process flow for how to handle major incidents. Clearly, the IT community would be well served to follow the ITIL’s systematic and professional approach, whose benefits, according to CIO Magazine.
The typical DevOps on-call engineer is responding to alerts, triaging based on service impact, troubleshooting high priority incidents, and taking action to remediate issues. Automation tools like AWS Systems Manager can be a big help in reducing some of the more repetitive work and allowing engineers to focus on the most important tasks.
In the IT world, if a server can fail or traffic can overload the network – it will. And the consequences of downtime are significant. Many IT organizations face database, hardware, and software downtime that last short periods or can shut down the business for days. According to Gartner, the average cost of network downtime alone is $5,600 per minute. What measures can organizations take to reduce IT downtime?
Major incidents are inevitable, and fixing them is the top priority for any ops or DevOps team. But what happens after service is restored? Do teams take the time to fully understand what went wrong, then follow up to prevent it happening again?