Auto-tuning vllm DevConf.US 2025

Auto-tuning vllm
.ical

2025-09-19 11:20–11:30, Hewitt Boardroom (Capacity 35)

My auto-tuning project aims to find the best settings for running large language models using vLLM. We want to maximize the number of output tokens / second (throughput). At the same time, we need to minimize the latency. Specifically we will ensure that the p95 latency is faster than the set baseline (default parameters). This involves testing different parameter configurations for supported models like Qwen3-32B-FP8 and Qwen3-30B-A3B-FP8.

What level of experience should the audience have to best understand your session? –

Beginner - no experience needed

See also:

Github Page to my Work
Matrix Chat and YouTube Stream
YouTube Stream Only

Rehan Samaratunga

At Red Hat I was a SWE Intern working for the Performance and Scale for AI Platforms team (PSAP)

Auto-tuning vllm .ical 2025-09-19 11:20–11:30, Hewitt Boardroom (Capacity 35)

Auto-tuning vllm
.ical

2025-09-19 11:20–11:30, Hewitt Boardroom (Capacity 35)