AIOps myths and how to avoid them Gartner coined the term AIOps in 2016 to refer to the combining of “big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.” In the five years since, AIOps has grown leaps and bounds — last year, AIOps was at the peak of the Gartner hype cycle.
Multi-cloud hybrid cloud environments, microservices architectures, the rapid growth in the number of mission-critical applications, and the sudden surge in remote work have made enterprise networks exponentially complex. These networks are often not designed to handle the variety of physical and wireless media that’s become common today, for instance, the number of video calls, data transfer through screen sharing, etc.
If 2020 was a year of turbulence, 2021 was the year of complete digital transformation. Enterprises across the globe focused their efforts on enabling stellar digital experiences — both to customers and internal stakeholders alike. This had a significant impact on the IT landscape. The number of applications that were ‘mission-critical’ increased overnight. In a recent survey, respondents said that they have an average of 71.4 mission-critical applications.
In 2017, McAfee found that an average enterprise uses 464 custom applications. A large enterprise — a company with over 50,000 employees — uses 788 custom apps! The more applications you have, the more complex your application environment is. This means that you are more susceptible to outages. So, the tolerance for downtime is impossibly low. Mission-critical applications must be available at all times.
At 8:54 pm on November 1, 2020, a customer of HDFC bank complained on Twitter that the bank’s services like internet banking and ATMs were down. More customers started raising similar issues over the next couple of hours, saying that UPI, credit card, and debit card transactions weren’t working either. Finally, at 11:55 pm, the bank confirmed that one of their data centers faced an outage. “Restoration shouldn’t take long,” they promised.
The cloud is driving enterprise digital transformation. Gartner predicts that by 2026, public cloud spending will exceed 45% of all enterprise IT spending, a 2.5x growth from 2021. Enterprises globally are accelerating application modernization, embracing the cloud. This is giving rise to a few key trends. Software-as-a-Service (SaaS) adoption is on the rise. So, organizations are using applications whose implementation/infrastructure they have little or no control over.
Alerts are notifications from AIOps monitoring tools that indicate that there is an anomaly. IT teams get these alerts on their monitoring dashboard via emails or enterprise collaboration tools such as Slack or Teams. Service level agreements expect IT teams to analyze every alert within a specific timeframe and take appropriate action.
Alerts are indispensable to any IT operations system today. Site reliability engineers (SREs) or ITOps executives set up several monitoring tools for their IT landscape. When there is a change, high-risk action, or outage in any of these incidents, the monitoring tool triggers an automated alert. This could happen on the monitoring tool’s dashboard itself, via email, or enterprise collaboration tools like Slack or Teams.
The most effective way to understand an incident, resolve it and prevent it from occurring again is root-cause analysis. Simply put, root-cause analysis is the study performed by ITOps teams or site reliability engineers (SREs) to pinpoint the exact element/error that caused the unexpected behavior. Based on this, they plan remediation. Accurate and timely root-cause analysis can have a direct impact on the company’s top and bottom line.