Personal resilience boosts operational resilience
Winter is a grinding time. The temperature, the darkness and the rain all take a toll on people. As a business, it's worth remembering that the human element of IT operations needs looking after just as much as the technology they maintain. Business leaders can't have one without the other.
Ensuring that engineering teams are stable, happy and able to manage the increasing demands put on them is essential, particularly when the team is held accountable to business-critical KPIs. People aren't the same as systems, and work, life, health, the season and more all have an impact on mindset and performance. Just because the weather may improve or an illness can be overcome, it doesn't mean employees will magically return to their best form right away.
Stability and reliability isn't just about individual well-being. Its effects, or lack thereof, permeate through entire teams and organisations, affecting the quality and reliability of the software they build and the systems they maintain. When safeguards aren't in place, even small missteps can escalate into large-scale failures.
Take for example the world's largest IT outage of July 2024. A security vendor's development process contained a few small logic flaws and software was shipped that would cause a cascading failure in Windows operating systems. DevSecOps processes should wrap around developers, and the company around the processes, as a series of self-checking defences. But mistakes do happen.
With such an increased pace of technological change and an ongoing digital transformation intensified by AI, disruptions should no longer be unexpected - they are part of the normal course of business. What separates a swift recovery from prolonged chaos is the level of preparedness in place before the crisis hits. The complexity of IT operations means that incidents and full-blown disruption are always a few small, innocent mistakes away unless teams are well-managed and processes are well practiced. In a very immediate and explicable association, personal and operational resilience are tightly bound and must be understood as a package.
A well-trained and prepared team will significantly shorten the recovery curve, enabling organisations to return to normal business operations much quicker. An operational crisis can either be an opportunity to build trust and instil confidence, or it can be a catastrophe. The difference is in preparation, process and the resilience of the people in working through the problem.
Resilience is a tree: Plant it years before you need the timber
The best response is not activity A or B from a playbook. It's being ready to respond to any presented risk from its outset.
Firstly: vigilance. Foster a culture of constant awareness, anticipation and questioning of readiness to different risks. Frequent drills and tabletop exercises keep teams sharp and responsive, ensuring no hesitation when time is critical. If the business can't afford the time and investment for such education and preparation, it must ask itself if it can afford the fallout from any of the likely poor operational outcomes it faces, making a realistic assessment and then adjusting business plans accordingly.
Vigilance is not just internal readiness. It requires businesses to maintain a 360-degree view of the risk landscape around them, including major sappers of operational team morale and effectiveness like burnout or 'alert fatigue'. This vigilance extends beyond regularly monitoring systems, updating security protocols and ensuring plans and playbooks are current. It must also involve scanning for threats, new cybersecurity risks, disruptive technological advancements to existing processes and business models or changes to regulation. It means taking an honest and egoless review of team management, working practices and interpersonal relationships.
Complexities grow the bigger and more widespread a business becomes. Seamless operations may mean handing off projects across time zones at the end of the working day. Communication, context and shared systems and tools all become critical in avoiding mishaps.
Invest in cultivating personal resilience among employees. Ongoing training sessions and building an active on-call culture are key to helping teams manage the expected stress and unpredictability that accompany crisis situations. Foster a supportive environment and ensure employees are not merely technically prepared, but mentally resilient as well. The aim is to not only react to incidents, but to build a more resilient team to service customers through any type of event.
From a technical standpoint, invest in a platform that will support the operations team in everything they do, from incident management to AIOps to reduce alert noise and accelerate their triage. With the right support to provide reliable automation teams can accelerate critical work and spend their time on high value tasks.
Better together
The right tools can support integrating incident learnings into operational playbooks so that teams are better prepared for the next similar incident. Awareness and preparedness increase with learnings from past incidents put to work in refined crisis strategies. Regularly evaluating crisis response to uncover ways to respond faster, see patterns and anticipate future incidents in advance is key, and relies on understanding the current cultural and technological business landscape and how past incidents unfolded.
At every stage in the resilience process, it's worth noting that only by supporting people and technology as individual 'components' and complementary parts will the system work as intended.