DevConf.US 2025

Ashish Kamra

Accomplished engineering leader with 15+ years of experience in AI, cloud-native platforms, and infrastructure. Proven track record of building and scaling high-performing teams and delivering significant performance improvements in enterprise AI products. Combines deep technical expertise in AI/ML with strategic vision to drive product innovation and business impact.


Job title

Senior Manager

Company or affiliation

Red Hat


Session

09-20
14:50
35min
Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs - from soup to nuts.
Ashish Kamra, David Gray

Modern LLM applications demand reliable, reproducible performance numbers that reflect real-world serving conditions. This tutorial-style presentation walks attendees through every step required to collect meaningful inference benchmarks on consumer or datacenter NVIDIA GPUs using an entirely open-source stack on Fedora. Beginning with enabling RPM Fusion and installing the akmod-nvidia driver, we show how to validate hardware visibility with nvidia-smi, then layer Podman 5.x and the NVIDIA Container Toolkit’s Container Device Interface to obtain rootless GPU access. We next demonstrate pulling the lightweight vLLM inference image, mounting a locally cached TinyLlama model downloaded via the Hugging Face CLI, and exposing an OpenAI-compatible HTTP endpoint. Finally, we introduce GuideLLM, an automated load-generation tool that sweeps request rates, captures latency buckets, throughput ceilings, and token-per-second statistics, and writes structured JSON for downstream analysis. Live demos illustrate common pitfalls and give attendees troubleshooting checklists that transfer directly to any Red Hat-derived distribution. Participants will leave with a turnkey recipe they can adapt to larger models, multi-GPU nodes and a clear understanding of how configuration choices cascade into benchmark accuracy. No prior container, CUDA, or benchmarking experience is assumed. Attendees also receive sample scripts and links for immediate hands-on replication today.

Artificial Intelligence and Data Science
Ladd Room (Capacity 96)