2025-06-13 –, D105 (capacity 300)
Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications.
Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP?
This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization.
We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.
Intermediate - attendees should be familiar with the subject
Anish is an engineering manager at Red Hat in the OpenShift AI organization. He is working on making machine learning easier for the wider community by building a platform out with cloud capabilities at the core. Most recently, his interests have been focused on the Training and Experimentation space via components in Kubeflow. He has previously been invested heavily in areas such as monitoring, scalability, and reliability.