Ayush Satyam
Associate Machine learning engineer working at Red Hat. Active contributor of Pytorch, vLLM and maintainer of projects like DiceDB.
Session
Mixture of Experts (MoE) architectures trade dense computation for conditional sparsity activating only a subset of experts per input token. But this sparsity doesn't come for free: dynamic routing decisions, irregular memory access, and excessive kernel launches can quietly undermine performance. This talk covers optimization strategies for MoE inference using PyTorch 2.x's compilation stack alongside vLLM's serving framework. We will show Profiler traces to illustrate four key areas for optimization: kernel fusions, FX graph optimizations, memory layout optimization, and dynamic shape specialization for variable batch sizes. After that will discuss how to extract insights from profiler data mapping kernel timelines to specific fusion passes, identifying memory-bound vs. compute-bound expert execution, and validating that compiled MoE forward passes maintain batch size flexibility without guard-induced recompilation.