Icinga Camp Berlin 2019 | System Diagnostics: A Deeper Understanding by Francesco Melchiori
The performance of ICT services (e.g. popular ERP network applications) fluctuate over time: they can be affected by latencies or downtimes. This is due to a variety of factors such as the number of user connections, the quality of software optimization, the balancing of hardware resources, etc.
Monitoring systems collect data with a large amount and variety of performance metrics from the entire computer network that provides the above-mentioned services. Therefore it becomes difficult even for experts to map general malfunctions to precise causes in a short time.
We are developing machine learning pipelines that detect when a system is in an anomalous state and diagnose probable outlier sources. So the focus of an ICT technician is quickly routed towards a singular or chronic problem for more speedily planning a solution.
We present our current unsupervised approach to analyzing a real scenario, and also a glimpse of the future: human and robotic supervision for reinforcement learning.