Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs - from soup to nuts.
Modern LLM applications demand reliable, reproducible performance numbers that reflect real-world serving conditions. This tutorial-style presentation walks attendees through every step required to collect meaningful inference benchmarks on consumer or datacenter NVIDIA GPUs using an entirely open-source stack on Fedora. Beginning with enabling RPM Fusion and installing the akmod-nvidia driver, we show how to validate hardware visibility with nvidia-smi, then layer Podman 5.x and the NVIDIA Container Toolkit’s Container Device Interface to obtain rootless GPU access. We next demonstrate pulling the lightweight vLLM inference image, mounting a locally cached TinyLlama model downloaded via the Hugging Face CLI, and exposing an OpenAI-compatible HTTP endpoint. Finally, we introduce GuideLLM, an automated load-generation tool that sweeps request rates, captures latency buckets, throughput ceilings, and token-per-second statistics, and writes structured JSON for downstream analysis. Live demos illustrate common pitfalls and give attendees troubleshooting checklists that transfer directly to any Red Hat-derived distribution. Participants will leave with a turnkey recipe they can adapt to larger models, multi-GPU nodes and a clear understanding of how configuration choices cascade into benchmark accuracy. No prior container, CUDA, or benchmarking experience is assumed. Attendees also receive sample scripts and links for immediate hands-on replication today.