Project goal was to detect anomalies of sensor data of a deep learning development server cluster. The cluster consists of high scaled GPU servers and is utilized to process high volumes of data to train machine learning models and neural networks. These kind of processes run normally for several hours. This makes them a bottleneck resource. To gain more information about the general occupancy of the cluster and to get notifications when big training processes are kicked off, a monitoring and anomaly detection solution was needed. Like this automated alerts for major deviation from normal behaviors should be reported to the person in charge via e-mail.
For each server 58 features are recorded every second and stored in a CSV file separated by server. The sensors register several data types for the different hardware components. Differing from capacity, temperature, clock speeds and power consumption they cover all important aspects to evaluate the current workloads and occupancy rates.
Challenges & Solutions
The machine learning model needs to recognize and incorporate trends and notice unexpected trend changes or peaks - that means: the model needs to adjust itself over time. The chosen analytics tool is the highly scalable Elastic Stack. With the Logstash module a real-time pipeline was build up, to directly store the most current data from the log files in an Elasticsearch cluster. Kibana was set up as the interface, which is part of the X-Pack extension.
A direct network access for the log file directories was configured. The most current sensor data is ready to be analyzed with less than a second of delay.
The Machine Learning Module was configured to monitor several critical sensors and inference trends in the data. It analyzes temperature, clock speeds and capacities for CPU, GPU, RAM and storage components. For exceptional events on server clusters, e-mail alerts for different severity thresholds were configured.
A real-time monitoring tool with a self learning anomaly detection model was implemented. It is configured to automatically send e-mails, for certain severity thresholds of several hardware components. Like this the utilization of the deep learning server cluster can be managed more effectively.