2025-09-19 –, Hewitt Boardroom (Capacity 35)
My auto-tuning project aims to find the best settings for running large language models using vLLM. We want to maximize the number of output tokens / second (throughput). At the same time, we need to minimize the latency. Specifically we will ensure that the p95 latency is faster than the set baseline (default parameters). This involves testing different parameter configurations for supported models like Qwen3-32B-FP8 and Qwen3-30B-A3B-FP8.
Beginner - no experience needed
At Red Hat I was a SWE Intern working for the Performance and Scale for AI Platforms team (PSAP)