Bio: Tianxiao Shen is a postdoctoral scholar at the University of Washington, working with Yejin Choi and Zaid Harchaoui. Her research interests lie in natural language processing and machine learning, in particular developing models and algorithms for efficient, accurate, diverse, flexible and controllable text generation. She received her PhD from MIT, advised by Regina Barzilay and Tommi Jaakkola. Before that, she did her undergrad at Tsinghua University.
Abstract: We propose GumbelSpec sampling, a novel algorithm that leverages smaller language models to accelerate inference of large language models without changing their output distribution. Central to our approach is the application of the Gumbel-Softmax technique to convert the stochastic decoding process into a deterministic process by integrating independently sampled Gumbel noise. Employing the same set of Gumbel noise, we perform beam search on the smaller model to generate multiple candidate short continuations, and then utilize tree-based attention to efficiently verify them in parallel using the larger model. GumbelSpec sampling significantly improves upon previous rejection sampling based speculative decoding methods by increasing the token acceptance rate by 1.7x-2.2x and achieving an additional speedup of 1.2x-1.5x. This results in a total speedup of 1.5x-2.6x compared to traditional autoregressive decoding.