BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.devconf.info//devconf-us-2025//talk//FAACR8
BEGIN:VTIMEZONE
TZID:EST
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10;UNTIL=20061029T070000Z
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000402T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T080000Z
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T030000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-devconf-us-2025-FAACR8@pretalx.devconf.info
DTSTART;TZID=EST:20250919T143000
DTEND;TZID=EST:20250919T155000
DESCRIPTION:Large language models are powerful—but they’re also resourc
 e-intensive. Running them in production can be scary expensive without the
  right tooling and optimizations. That’s where vLLM and quantization com
 e in: together\, they offer a practical path to serving models at high spe
 ed and low cost\, even on modest hardware.\n\nIn this workshop\, you’ll 
 learn how to combine vLLM’s high-performance serving engine with quantiz
 ed models. Whether you're deploying to GPU servers in the cloud or smaller
 -scale on-prem environments\, you’ll leave with the skills to drasticall
 y reduce inference latency and memory usage—without compromising output 
 accuracy.\n\nYou’ll learn how to:\n- Deploy quantized LLMs using vLLM’
 s OpenAI-compatible API\n- Choose the right quantization formats for your 
 hardware and use case\n- Use tools like llm-compressor to generate optimiz
 ed models\n- Benchmark and compare performance across different quantizati
 on settings\n- Tune vLLM configurations for throughput\, latency\, and mem
 ory efficiency\n\nBy the end of the session\, you will know how to deploy 
 your own quantized model on vLLM and apply these optimizations to your own
  production Gen AI stack.
DTSTAMP:20260310T055809Z
LOCATION:107 (Capacity 20)
SUMMARY:Fast\, Cheap\, and Accurate: Optimizing LLM Inference with vLLM and
  Quantization - Legare Kerrison\, Taylor Smith
URL:https://pretalx.devconf.info/devconf-us-2025/talk/FAACR8/
END:VEVENT
END:VCALENDAR
