2026-06-19 –, D105 (capacity 300)
GPU communication is the bottleneck in LLM serving. Traditional PyTorch collectives copy data and synchronize unnecessarily, adding 100+ microseconds per operation. This talk introduces symmetric memory: zero-copy RDMA between GPUs using NCCL 2.29's one-sided APIs. We'll explore three primitives (put_signal, wait_signal, barrier) that enable direct GPU memory access with <10 ms latency—10x faster than traditional collectives. The implementation integrates with torch.compile through a registration API, allowing any operator to declare symmetric memory requirements without modifying compiler code. Live demonstrations show 35% throughput improvement in tensor-parallel LLM inference. We'll cover the architecture, memory registration, compiler integration challenges, and production deployment guidance. Attendees learn when to use symmetric memory versus traditional collectives, how to integrate it into applications, and PyTorch's GPU communication roadmap.
Rohit Singh Rathaur is an AI researcher and software engineer specializing in Deep Learning, Natural Language Processing (NLP), and Quantum Machine Learning. As of 2026, he is a Machine Learning Engineer (MLE) at Red Hat, following previous roles at companies like io.net.