DevConf.US 2025

Auto-tuning vllm
2025-09-19 , Hewitt Boardroom (Capacity 35)

My auto-tuning project aims to find the best settings for running large language models using vLLM. We want to maximize the number of output tokens / second (throughput). At the same time, we need to minimize the latency. Specifically we will ensure that the p95 latency is faster than the set baseline (default parameters). This involves testing different parameter configurations for supported models like Qwen3-32B-FP8 and Qwen3-30B-A3B-FP8.


What level of experience should the audience have to best understand your session?

Beginner - no experience needed

See also:

At Red Hat I was a SWE Intern working for the Performance and Scale for AI Platforms team (PSAP)