BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.devconf.info//devconf-cz-2026//talk//TM78BY
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-devconf-cz-2026-TM78BY@pretalx.devconf.info
DTSTART;TZID=CET:20260619T144500
DTEND;TZID=CET:20260619T152000
DESCRIPTION:GPU communication is the bottleneck in LLM serving. Traditional
  PyTorch collectives copy data and synchronize unnecessarily\, adding 100+
  microseconds per operation. This talk introduces symmetric memory: zero-c
 opy RDMA between GPUs using NCCL 2.29's one-sided APIs. We'll explore thre
 e primitives (put_signal\, wait_signal\, barrier) that enable direct GPU m
 emory access with <10 ms latency—10x faster than traditional collectives
 . The implementation integrates with torch.compile through a registration 
 API\, allowing any operator to declare symmetric memory requirements witho
 ut modifying compiler code. Live demonstrations show 35% throughput improv
 ement in tensor-parallel LLM inference. We'll cover the architecture\, mem
 ory registration\, compiler integration challenges\, and production deploy
 ment guidance. Attendees learn when to use symmetric memory versus traditi
 onal collectives\, how to integrate it into applications\, and PyTorch's G
 PU communication roadmap.
DTSTAMP:20260430T125007Z
LOCATION:D105 (capacity 300)
SUMMARY:Symmetric Memory in PyTorch: 10x Faster GPU Communication for AI - 
 Rohit Singh Rathaur
URL:https://pretalx.devconf.info/devconf-cz-2026/talk/TM78BY/
END:VEVENT
END:VCALENDAR