DevConf.US 2025

Leveraging Teacher Models for Efficient Synthetic Data Generation in LLMs
2025-09-19 , Ladd Room (Capacity 96)

Generating high-quality, domain-specific data for large language models (LLMs) is a significant challenge, particularly when creating datasets relevant for model customization and fine-tuning. In this session, the audience will learn how synthetic data generation techniques can address this challenge. The speaker will cover how third-party teacher models like Mixtral, Mistral, Phi-4, and LLaMA streamline the process, along with open-source tools like Docling, which breaks down complex knowledge documents into semantic chunks.

Additionally, the session will walk the audience through how they can bring their own teacher models into the SDG workflow, enabling the creation of higher-quality samples for model customization. Attendees will also learn how to build modular, flexible workflows without any coding skills, making it easy to scale data generation tasks.

The session will demonstrate how to efficiently process large volumes of knowledge data and generate high-quality samples using an open-source, cost-effective approach to building production-ready LLMs—without relying on extensive manual annotation.


What level of experience should the audience have to best understand your session?

Intermediate - attendees should be familiar with the subject

Aakanksha Duggal is a Principal Data Scientist at Red Hat, leading synthetic data generation efforts on RHELAI. Her work focuses on advancing scalable and impactful technologies in the field of AI.