DevConf.US 2025

Taylor Smith

Taylor Smith, Senior Developer Advocate at Red Hat, is an advocate of open source AI innovation and democratization. She has a background in software development and working with open source technologies like Kubernetes and linux. Taylor loves music, animals, and helping to solve real-world problems with technology. Based out of North Carolina.


Job title

Senior Developer Advocate

Company or affiliation

Red Hat


Session

09-19
14:40
80min
Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization
Legare Kerrison, Taylor Smith

Large language models are powerful—but they’re also resource-intensive. Running them in production can be scary expensive without the right tooling and optimizations. That’s where vLLM and quantization come in: together, they offer a practical path to serving models at high speed and low cost, even on modest hardware.

In this workshop, you’ll learn how to combine vLLM’s high-performance serving engine with quantized models. Whether you're deploying to GPU servers in the cloud or smaller-scale on-prem environments, you’ll leave with the skills to drastically reduce inference latency and memory usage—without compromising output accuracy.

You’ll learn how to:
- Deploy quantized LLMs using vLLM’s OpenAI-compatible API
- Choose the right quantization formats for your hardware and use case
- Use tools like llm-compressor to generate optimized models
- Benchmark and compare performance across different quantization settings
- Tune vLLM configurations for throughput, latency, and memory efficiency

By the end of the session, you will know how to deploy your own quantized model on vLLM and apply these optimizations to your own production Gen AI stack.

Artificial Intelligence and Data Science
107 (Capacity 20)