Rehan Samaratunga
At Red Hat I was a SWE Intern working for the Performance and Scale for AI Platforms team (PSAP)
student
Company or affiliation –Boston University
Session
09-19
11:20
10min
Auto-tuning vllm
Rehan Samaratunga
My auto-tuning project aims to find the best settings for running large language models using vLLM. We want to maximize the number of output tokens / second (throughput). At the same time, we need to minimize the latency. Specifically we will ensure that the p95 latency is faster than the set baseline (default parameters). This involves testing different parameter configurations for supported models like Qwen3-32B-FP8 and Qwen3-30B-A3B-FP8.
Intern Showcase
Hewitt Boardroom (Capacity 35)