DevConf.CZ 2025

Predicting Faulty Validations in Cluster Issue Detection: A Machine Learning Approach
2025-06-12 , D105 (capacity 300)

Our team maintains a large-scale codebase for detecting and predicting issues in clusters, with hundreds of validation rules contributed and regularly updated by SRE engineers. A key challenge is identifying false positives and predicting which validations are most likely to require fixes.

In this research, we analyze validation rules as repeated code patterns, creating a unique dataset for machine learning. We compute numerical descriptors—such as code length, complexity, entropy, and time since introduction—across different Git branches and compare them with historical bug fixes. Preliminary results indicate strong correlations between these factors and validation reliability.

In this talk, we will present our findings using classical machine learning models and benchmark them against modern large language models (LLMs). We will discuss the effectiveness of both approaches, and the potential impact on automated validation quality improvement.


What level of experience should the audience have to best understand your session?

Beginner - no experience needed

Liat Pele is a Development Team Leader at RedHat, working on troubleshooting and validation tools for cloud infrastructure. She has been involved in developing system validation and automation tools to support SRE group.

She previously contributed to research initiatives like the Horizon2020 NGPaaS project and holds a Ph.D. in Computational Chemistry from the Hebrew University of Jerusalem.

At DevConf.CZ 2025, she will share insights on using machine learning and LLMs to improve automated cluster validation and reduce false positives.

Dr. Ofir Pele is a data scientist and algorithms developer. He previously led ML initiatives at Western Digital that delivered multi-million dollar impact through innovations operating under strict reliability requirements. His research background includes positions at Ariel University and University of Pennsylvania, with publications garnering over 2,500 citations and 12 patents. Dr. Pele holds a Ph.D. in Computer Science from The Hebrew University of Jerusalem, where industry and academic research groups have adopted his work on distance functions. His expertise spans explainable and constrained ML, DL, RL, CV, and large-scale simulations and experiments with applications in diverse domains including storage technology, cybersecurity, and healthcare.