Erik Erlandson
Erik is the AI and Data Science lead at Red Hat's Emerging Technologies group, where he leads a team of data scientists and software engineers who evaluate new technologies at the intersection of data science, AI and cloud native development.
AI and Data Science Lead
Company or affiliation –Red Hat
Session
Large language models learn to predict human and machine text as sequences of “tokens.” But what are these tokens, and how are they used to represent text? The answers to these questions matter: they form the foundation of how every LLM generates its output, and how its output correctness trades off against compute performance.
In this talk Erik Erlandson will explore a variety of algorithms used to tokenize text before it's processed by these models, focusing on their trade-offs and impact on model performance. He’ll compare algorithms for word-based, subword-based, and character-level tokenization, including widespread approaches such as Byte Pair Encoding and WordPiece.
Attendees will gain an understanding of how LLMs depend on tokenization and how choices of tokenization impact model performance tradeoffs.