Taylor Smith
Taylor Smith, Senior Developer Advocate at Red Hat, is an advocate of open source AI innovation and democratization. She has a background in software development and working with open source technologies like Kubernetes and linux. Taylor loves music, animals, and helping to solve real-world problems with technology. Based out of North Carolina.
Senior Developer Advocate
Company or affiliation –Red Hat
Session
Large language models are powerful—but they’re also resource-intensive. Running them in production can be scary expensive without the right tooling and optimizations. That’s where vLLM and quantization come in: together, they offer a practical path to serving models at high speed and low cost, even on modest hardware.
In this workshop, you’ll learn how to combine vLLM’s high-performance serving engine with quantized models. Whether you're deploying to GPU servers in the cloud or smaller-scale on-prem environments, you’ll leave with the skills to drastically reduce inference latency and memory usage—without compromising output accuracy.
You’ll learn how to:
- Deploy quantized LLMs using vLLM’s OpenAI-compatible API
- Choose the right quantization formats for your hardware and use case
- Use tools like llm-compressor to generate optimized models
- Benchmark and compare performance across different quantization settings
- Tune vLLM configurations for throughput, latency, and memory efficiency
By the end of the session, you will know how to deploy your own quantized model on vLLM and apply these optimizations to your own production Gen AI stack.