2026-02-13 –, VYAS - 1 - Room#VY124
Most Retrieval-Augmented Generation (RAG) systems fail long before the LLM even comes into play. The real issue is not the model, but the documents feeding it. Enterprise PDFs often have broken reading order, distorted tables, inconsistent formatting, embedded images, and scattered metadata. When this messy content enters the retrieval pipeline, even the strongest language model will struggle, leading to irrelevant answers or subtle hallucinations. This talk breaks down the reasons why RAG often collapses in real-world conditions, and shows how open-source tools can turn a fragile workflow into something reliable.
The first part of the session introduces Docling, an open-source document processing toolkit that converts complex PDFs, Word files, presentations, images, and audio into clean and structured content. It preserves layout, hierarchy, tables, and multimodal elements so that your RAG pipeline finally receives high-quality input. The second part covers OpenSearch, a fully open and scalable engine for vector indexing, hybrid retrieval, and metadata-driven search. Together, these tools offer a practical foundation for building RAG systems that are accurate, explainable, and robust at enterprise scale.
We will walk through the overall architecture, key design patterns, and lessons learned from real implementations. To make the concepts concrete, the session will end with a short demo that takes a messy PDF, processes it through Docling, indexes it in OpenSearch, and queries it within a RAG workflow that consistently returns the right context. Attendees will leave with a clear understanding of why many RAG systems fail today and a practical roadmap for building reliable RAG applications using open-source technology.
Anindita Sinha Banerjee is a Data Scientist at Red Hat and former researcher at Tata Research. Authored research work at PAKDD, ACL Workshop. Speaker at Open Source Summit, KubeCon and PyCon India. Contributed to 3 patents. Open-source enthusiast. Google Scholar: https://scholar.google.com/citations?user=5GCQcVkAAAAJ&hl=en&oi=ao
I am a Senior Data Scientist and Technical Architect specializing in building end-to-end AI and analytics solutions. My core expertise includes EDA, statistical modeling, machine learning, forecasting, and designing scalable AI applications. I have strong hands-on experience with GenAI tools such as LangChain, OpenAI APIs, GPT-Vision, and Gemini LLM, along with vector databases and full CI/CD deployment pipelines.
I also bring deep experience in solving complex text-data problems—designing solutions using traditional NLP techniques as well as modern AI approaches like LLMs and neural networks. At Red Hat, I focus on architecting reliable, production-ready AI systems that deliver meaningful business insights and support strategic decision-making.