Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization DevConf.US 2025

Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization
.ical

2025-09-19 14:40–16:00, 107 (Capacity 20)

Large language models are powerful—but they’re also resource-intensive. Running them in production can be scary expensive without the right tooling and optimizations. That’s where vLLM and quantization come in: together, they offer a practical path to serving models at high speed and low cost, even on modest hardware.

In this workshop, you’ll learn how to combine vLLM’s high-performance serving engine with quantized models. Whether you're deploying to GPU servers in the cloud or smaller-scale on-prem environments, you’ll leave with the skills to drastically reduce inference latency and memory usage—without compromising output accuracy.

You’ll learn how to:
- Deploy quantized LLMs using vLLM’s OpenAI-compatible API
- Choose the right quantization formats for your hardware and use case
- Use tools like llm-compressor to generate optimized models
- Benchmark and compare performance across different quantization settings
- Tune vLLM configurations for throughput, latency, and memory efficiency

By the end of the session, you will know how to deploy your own quantized model on vLLM and apply these optimizations to your own production Gen AI stack.

What level of experience should the audience have to best understand your session? –

Intermediate - attendees should be familiar with the subject

Legare Kerrison

Legare Kerrison is a Developer Advocate on Red Hat's AI team, focused on open source tools for building and deploying AI. Currently, she works with projects like InstructLab, which simplifies fine-tuning large language models; vLLM, a high-throughput inference engine; and Podman Desktop, which supports containerized and Kubernetes-based workflows. She is based in Boston.

Taylor Smith

Taylor Smith, Senior Developer Advocate at Red Hat, is an advocate of open source AI innovation and democratization. She has a background in software development and working with open source technologies like Kubernetes and linux. Taylor loves music, animals, and helping to solve real-world problems with technology. Based out of North Carolina.

Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization .ical 2025-09-19 14:40–16:00, 107 (Capacity 20)

Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization
.ical

2025-09-19 14:40–16:00, 107 (Capacity 20)