DevConf.CZ 2025

Sagar Jalindar Aivale

I am Sagar Aivale, currently working as Senior Data Engineer with the UXE DATA Foundations team, previously a Senior Software Engineer for the Rules Acceleration and Automation team. With over 7+ years at RedHat, I enjoy collaborating with fellow Red Hat colleagues and learning about their work. Outside of work, I love spending time with friends and family.


Company or affiliation

RedHat

Job title

Senior Data Engineer


Session

06-13
16:15
35min
PagedAttention: Revolutionizing Large Language Model Inference with Efficient Memory Management
Rahul Belokar, Sagar Jalindar Aivale

Large language models (LLMs) are pushing the boundaries of artificial intelligence, but their deployment is often hampered by memory bottlenecks arising from the ever-growing size of key-value (KV) caches. Traditional LLM serving systems struggle with inefficient memory utilization and limited scalability. Inspired by the concept of virtual memory paging in operating systems, PagedAttention offers a transformative solution. This novel technique partitions the KV cache into smaller, non-contiguous blocks, enabling dynamic allocation, efficient retrieval, and flexible reuse of memory. By decoupling the physical layout of the cache from the logical structure, PagedAttention minimizes memory fragmentation and overhead.

This approach, integrated within the vLLM framework, an open-source, high-performance LLM serving framework developed at UC Berkeley, yields significant performance gains.Designed to address memory bottlenecks in traditional LLM serving methods, vLLM leverages PagedAttention for efficient KV cache management, optimizing batch processing and eliminating redundant computations. As a result, PagedAttention achieves up to 30× higher throughput compared to traditional LLM serving methods like Hugging Face Transformers, Orca, and NVIDIA’s FasterTransformer. It also reduces KV cache waste to approximately 4%, ensuring near-optimal memory usage and enabling larger batch processing by minimizing memory overhead.

Furthermore, vLLM seamlessly supports advanced sampling techniques, including beam search, without compromising latency. While challenges such as the overhead of managing lookup tables and the potential for increased latency in certain scenarios exist, ongoing research is addressing these limitations. For example, optimized data structures and prefetching strategies can mitigate lookup overhead. Despite these challenges, PagedAttention represents a major advancement in LLM inference, unlocking the potential for scalable and efficient deployment, even on resource-constrained hardware. This breakthrough paves the way for wider adoption of LLMs and empowers researchers to explore even larger and more complex models.

Artificial Intelligence and Data Science
D105 (capacity 300)