BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//IFDS - ECPv6.0.1.1//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-ORIGINAL-URL:https://ifds.info
X-WR-CALDESC:Events for IFDS
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20240310T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20241103T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20240223T133000
DTEND;TZID=America/Los_Angeles:20240223T143000
DTSTAMP:20260709T011823
CREATED:20240318T212752Z
LAST-MODIFIED:20240318T212752Z
UID:2886-1708695000-1708698600@ifds.info
SUMMARY:GumbelSpec Sampling for Accelerating LLM Inference
DESCRIPTION:Bio: Tianxiao Shen is a postdoctoral scholar at the University of Washington\, working with Yejin Choi and Zaid Harchaoui. Her research interests lie in natural language processing and machine learning\, in particular developing models and algorithms for efficient\, accurate\, diverse\, flexible and controllable text generation. She received her PhD from MIT\, advised by Regina Barzilay and Tommi Jaakkola. Before that\, she did her undergrad at Tsinghua University. \n\n\n\n\nAbstract: We propose GumbelSpec sampling\, a novel algorithm that leverages smaller language models to accelerate inference of large language models without changing their output distribution. Central to our approach is the application of the Gumbel-Softmax technique to convert the stochastic decoding process into a deterministic process by integrating independently sampled Gumbel noise. Employing the same set of Gumbel noise\, we perform beam search on the smaller model to generate multiple candidate short continuations\, and then utilize tree-based attention to efficiently verify them in parallel using the larger model. GumbelSpec sampling significantly improves upon previous rejection sampling based speculative decoding methods by increasing the token acceptance rate by 1.7x-2.2x and achieving an additional speedup of 1.2x-1.5x. This results in a total speedup of 1.5x-2.6x compared to traditional autoregressive decoding.
URL:https://ifds.info/event/gumbelspec-sampling-for-accelerating-llm-inference/
LOCATION:CSE (Allen) 403
CATEGORIES:MLOpt@UWash
END:VEVENT
END:VCALENDAR