BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.devconf.info//YFEMGN
BEGIN:VTIMEZONE
TZID:EST
BEGIN:STANDARD
DTSTART:20000101T000000
RRULE:FREQ=YEARLY;BYMONTH=1;UNTIL=20050101T050000Z
TZNAME:EST
TZOFFSETFROM:-0500
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20061029T030000
RRULE:FREQ=YEARLY;BYDAY=5SU;BYMONTH=10;UNTIL=20061029T070000Z
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20060402T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T080000Z
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T030000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-devconf-us-2025-YFEMGN@pretalx.devconf.info
DTSTART;TZID=EST:20250920T110000
DTEND;TZID=EST:20250920T113500
DESCRIPTION:Large Language Models (LLM) require preprocessing vast amounts 
 of data\, a process that can span days due to its complexity and scale\, o
 ften involving PetaBytes of data. This talk demonstrates how Kubeflow Pipe
 lines (KFP) simplify LLM data processing with flexibility\, repeatability\
 , and scalability. These pipelines are being used daily at IBM Research to
  build indemnified LLMs tailored for enterprise applications.\nDifferent d
 ata preparation toolkits are built on Kubernetes\, Rust\, Slurm\, or Spark
 . How would you choose one for your own LLM experiments or enterprise use 
 cases and why should you consider Kubernetes and KFP?\nThis talk describes
  how open source Data Prep Toolkit leverages KFP and KubeRay for scalable 
 pipeline orchestration\, e.g. deduplication\, content classification\, and
  tokenization.\nWe share challenges\, lessons\, and insights from our expe
 rience with KFP\, highlighting its applicability for diverse LLM tasks\, s
 uch as data preprocessing\, RAG retrieval\, and model fine-tuning.
DTSTAMP:20250705T131636Z
LOCATION:Ladd Room (Capacity 96)
SUMMARY:Generative AI Model Data Pre-Training on Kubernetes: A Use Case Stu
 dy - Humair Khan\, Santosh Borse
URL:https://pretalx.devconf.info/devconf-us-2025/talk/YFEMGN/
END:VEVENT
END:VCALENDAR