Operations | Monitoring | ITSM | DevOps | Cloud

AIOps

The latest News and Information on AIOps, alerting in complex systems and related technologies.

What is Mean Time Between Failures - and why does it matter for service availability

Mean Time Between Failures (MTBF) measures the average duration between repairable failures of a system or product. MTBF helps us anticipate how likely a system, application or service will fail within a specific period or how often a particular type of failure may occur. In short, MTBF is a vital incident metric that indicates product or service availability (i.e. uptime) and reliability.

Accelerated Remediations: How to Maximize AIOps Investments in Network Operations

So, you’ve spent some money and you’re the proud owner of a shiny new AIOps tool that helps improve your Network Operations. Network alarms are now usable, but with all the constant monitoring, supervision, and incident management, your Network Operations Center (NOC) is still overwhelmed. It’s time to pull out another stop.

Generative AI for IT Operations: Your Questions Answered

IT leaders are thrilled about the potential of Generative AI for IT Operations. But they also want to know how it works, why it works, and what it will do for them before taking the leap and adopting this new technology. Allow me to share my perspective on the hype and the truth behind Generative AI. I’m the Field CTO for BigPanda, Operational Intelligence and Automation driven by AIOps.

Accelerate change alert discovery and incident resolution with Root Cause Changes

Today, the majority of organizations operate under a hybrid cloud structure. Due to this, operations are consistently met with daily infrastructure and software changes and updates, which are also the primary cause of incidents and outages. Long gone are the days when a tech stack could be represented by a single dependency model. Microservices, CI/CD, and containers across multi-cloud make it extremely difficult to track all the changes and connect them to incidents.

Why automated Root Cause Analysis matters for driving down MTTR

Finding the root causes of IT anomalies can be challenging, but the rewards are worth it. By identifying the root cause or causes of an incident or critical failure, response teams can resolve incidents faster and determine the best steps to avoid having them recur. This can drive down both the frequency of service interruptions and their duration.

The Evolution of IT Monitoring

Zenoss Chief Product Officer Trent Fitz recently spoke with Dan Turchin, host of the podcast “AI and the Future of Work,” and shared some insightful perspectives on the evolution of monitoring in the IT industry, the role of AIOps tools, and the challenges of moving to the cloud. They also discussed Trent’s extensive background in computer engineering and his experience driving product innovation and strategy in various technology fields.

Machine Learning for Fast and Accurate Root Cause Analysis

Machine Learning (ML) for Root Cause Analysis (RCA) is the state-of-the-art application of algorithms and statistical models to identify the underlying reasons for issues within a system or process. Rather than relying solely on human intervention or time-consuming manual investigations, ML automates and enhances the process of identifying the root cause.