[ICML 2023] Fast inference from transformers via speculative decoding

news2025/1/4 16:29:51

Introduction
Speculative Decoding
- Standardized Sampling
- Speculative Sampling
Analysis
- Number of Generated Tokens
- Calculating $\alpha$
- Walltime Improvement
- Number of Arithmetic Operations
- Choosing $\gamma$
Experiments
References

Introduction

为了提升自回归 LLM 的推理速度，作者提出 speculative decoding，用小语言模型去加速 LLM 推理，从而在一个 decoding step 解码出多个 tokens，并且不改变 LLM 的输出 (1. inference from large models is often not bottlenecked on arithmetic operations, but rather on memory bandwidth and communication; 2. computing the logits of a short continuation of $K$ tokens in parallel has a very similar latency to that of sampling a single token)
speculative decoding 能给 T5-XXL (11B) 的推理带来 2X-3X 的加速

在这里插入图片描述

Speculative Decoding

Notation. $M_p$ 为要推理加速的 target model， $p (x)$ 为 prefix 为 $x_{<t}$ 时 $M_p$ 输出的概率分布 $p(x_t|x_{<t})$ . $M_q$ 为 approximation model， $q (x)$ 为 prefix 为 $x_{<t}$ 时 $M_q$ 输出的概率分布 $q(x_t|x_{<t})$

argmax, top-k, nucleus 等采样方法都可以被转化为从一个调整过的概率分布中依概率采样，例如 argmax sampling 可以看作是将模型输出的概率分布的非最大值全部设为 0，然后从归一化后的概率分布中采样，因此作者下面只考虑从概率分布中依概率采样的情况，但实际上各种采样方法都可以转化为该问题

Speculative Sampling. 先采样 $x\sim q(x)$ ，如果 $q(x)\leq p(x)$ 则接受该采样，反之以 $1-\frac{p(x)}{q(x)}$ 的概率拒绝该采样，然后重新从概率分布 $p^{'} (x) = n or m (ma x (0, p (x) - q (x)))$ 中采样。可以证明，上述采样方法采样得到的 $x$ 满足 $x\sim p(x)$ (见 “Correctness of Speculative Sampling”)
Speculative Decoding Step. 首先 $M_q$ 用自回归的方式采样出 $\gamma$ 个 tokens，然后将其连同 prompt 一起送入 $M_p$ 从而并行输出 $\gamma+1$ 个 tokens 的 $p (x)$ . 如果 $\gamma$ 个 tokens 都被接受了，则再从 $p_{\gamma+1}(x)$ 中采样出 token $t$ . 如果有 token 被拒绝，则对前 $\gamma$ 个 tokens，找到其中最先被拒绝的 token (假设是第 $n + 1$ 个 token)，将其重新从调整后的分布 $p^{'} (x)$ 中采样出 token $t$ ，接受前 $n$ 个 tokens 和 token $t$ ，这样一个 step 能解码出 $\sim \gamma+1$ 个 tokens

Correctness of Speculative Sampling
在这里插入图片描述

首先定义 acceptance rate $\beta$
假设 $\beta$ 独立同分布并记 $\alpha=E(\beta)$ ，则 # generated tokens 为 capped geometric variable，成功概率为 $1-\alpha$ ，cap 为 $\gamma+1$ ，Expected number of tokens produced by a single run of Algorithm 1 为

在这里插入图片描述

Corollary 3.6 最后的式子期望里少了求和号？

作者假设有足够的计算资源支持 increased concurrency，即 LLM 对 $\gamma+1$ 个 tokens 并行验证不会增加 walltime，speculative decoding 带来的额外开销仅为 approximation model $M_q$
cost coefficient $c$ . In our experiments where $M_q$ is typically a couple of orders of magnitude smaller than $M_p$ , $c$ was always less than 0.05 and often negligibly close to 0.
expected improvement factor in total walltime. 假如 $c$ 忽略不计，则 expected improvement factor 最大可以达到 $\frac{1}{1-\alpha}$ ( $\gamma\rightarrow\infty$ )

在这里插入图片描述

给定 $c$ 和 $\alpha$ 并假设有足够的计算资源，则最优的 $\gamma$ 需要最大化 walltime Improvement factor (i.e., Theorem 3.8). 由于 $\gamma$ 为整数，因此很容易找到数值解
trade-off between inference speed and the total number of arithmetic operations (assuming $\hat c = 0$ )

Empirical Walltime Improvement. $M_p$ : T5-XXL (11B). $M_q$ : T5-large (800M), T5-base (250M), and T5-small (77M)
Theoretical Predictions vs. Empirical Runtimes
Empirical $α$ Values. 可以发现对于所有模型而言，都有标准采样的 $\alpha$ 低于 $\argmax$ 的 $\alpha$ (the sharper the adjusted distribution, the higher the $\alpha$ values.)

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast inference from transformers via speculative decoding.” International Conference on Machine Learning. PMLR, 2023.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/736745.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！