Abstract: Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time, leading to high latency, which can be prohibitive in certain tasks. One way to speed up sampling is speculative decoding: use a small model to sample a draft (block of tokens), and then score all tokens in the draft by the large language model in parallel to reduce latency. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical verification algorithm to guarantee that the final output follows the distribution of the large model.
In this talk, we provide a principled understanding of speculative decoding through the lens of optimal transport. This new formulation enables us to improve upon speculative decoding in two ways. First, we propose an optimal block-level draft verification algorithm that provides additional wall-clock speedup without incurring additional computation cost. Next, when extra parallel computations are available, we show that we can further improve the speedup by drawing multiple drafts from the small model. We provide theoretical guarantees on the proposed algorithms and characterize the expected speedup. We further empirically demonstrate the practicality of the new algorithms on standard datasets.
Bio: Ziteng Sun is a research scientist at Google Research working on developing efficient and responsible machine learning algorithms. His work includes topics such as privacy-preserving machine learning, and efficient methods for large language models. His research interest lies broadly in machine learning, algorithmic statistics, and information theory. He obtained his PhD from Cornell University, advised by Professor Jayadev Acharya.