Parshant Sharma
Associate Machine Learning Engineer, PyTorch Engineering Team, Red Hat
Parshant is an Associate ML Engineer at Red Hat and a Gold Medalist in his Master’s in CSE with AIML specialization. He has authored four SCOPUS-indexed research papers in AI/ML. At Red Hat, he contributes to upstream open-source projects like PyTorch. He also has hands-on experience in AI compilers and works with open-source compiler frameworks like LLVM and MLIR, bridging ML workloads with systems-level optimization.
Session
Mixture of Experts (MoE) architectures trade dense computation for conditional sparsity activating only a subset of experts per input token. But this sparsity doesn't come for free: dynamic routing decisions, irregular memory access, and excessive kernel launches can quietly undermine performance. This talk covers optimization strategies for MoE inference using PyTorch 2.x's compilation stack alongside vLLM's serving framework. We will show Profiler traces to illustrate four key areas for optimization: kernel fusions, FX graph optimizations, memory layout optimization, and dynamic shape specialization for variable batch sizes. After that will discuss how to extract insights from profiler data mapping kernel timelines to specific fusion passes, identifying memory-bound vs. compute-bound expert execution, and validating that compiled MoE forward passes maintain batch size flexibility without guard-induced recompilation.