DevConf.US 2025

Aakanksha Duggal

Aakanksha Duggal is a Principal Data Scientist at Red Hat, leading synthetic data generation efforts on RHELAI. Her work focuses on advancing scalable and impactful technologies in the field of AI.


Job title

Principal Data Scientist

Company or affiliation

Red Hat Inc


Session

09-19
11:20
35min
Leveraging Teacher Models for Efficient Synthetic Data Generation in LLMs
Aakanksha Duggal

Generating high-quality, domain-specific data for large language models (LLMs) is a significant challenge, particularly when creating datasets relevant for model customization and fine-tuning. In this session, the audience will learn how synthetic data generation techniques can address this challenge. The speaker will cover how third-party teacher models like Mixtral, Mistral, Phi-4, and LLaMA streamline the process, along with open-source tools like Docling, which breaks down complex knowledge documents into semantic chunks.

Additionally, the session will walk the audience through how they can bring their own teacher models into the SDG workflow, enabling the creation of higher-quality samples for model customization. Attendees will also learn how to build modular, flexible workflows without any coding skills, making it easy to scale data generation tasks.

The session will demonstrate how to efficiently process large volumes of knowledge data and generate high-quality samples using an open-source, cost-effective approach to building production-ready LLMs—without relying on extensive manual annotation.

Artificial Intelligence and Data Science
Ladd Room (Capacity 96)