Jakub Borsky
I am currently a Master's student of Artificial Intelligence and Data Processing at Masaryk University in Brno, where I am also a member of the sec-certs team. In addition to my interest in AI and ML, I love diving deep into how things truly work under the hood.
Session
Extracting structured information from PDFs is a challenging task; the format was designed for visual consistency, not machine readability. Rule-based tools handle basic text extraction well but struggle with tables, semantic role identification, and specialized content like math formulas. Modern ML-based tools are more versatile but can hallucinate. Hybrid tools attempt to get the best of both worlds.
Docling is one such hybrid tool. It combines programmatic PDF parsing with additional ML models, producing a rich, structured document representation.
We integrated Docling into sec-certs, an open-source tool for automated analysis of Common Criteria and FIPS 140 certification documents, aiming to improve reliability and enable more sophisticated analysis.
This talk shares how structured output changes what's possible in automated analysis, how the pipeline improved, what worked (and what didn’t), and lessons learned when processing large collections of security certification PDFs.