Santosh Borse
With over 20 years in the tech industry, Santosh Borse has grown from Junior Developer to Technical Lead, Team Manager, and Architect, contributing to a wide spectrum of impactful projects. He is driven by the challenge of solving complex problems, advancing through continuous learning, and applying technical innovation to make a real-world difference.
His guiding motto is simple: make the world a better place through software.
Currently Santosh is part of Data Engineering team at IBM Research and working on preparing pre and post training data, improving quality of data and post training LLM models.
As an inventor, Santosh holds 11 granted patents spanning AI, natural language processing, drones, IoT, social media, data analytics, cloud computing, mobile, and speech processing. He also shares his learnings and insights with the broader community at https://medium.com/@sanborse
.
Santosh holds a Master’s degree in Computer Science.
Senior Engineer
Company or affiliation –IBM Research
Session
Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications.
Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP?
This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization.
We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.