The Power of Scale for Parameter-Efﬁcient Prompt Tuning

系列论文研读目录

文章目录

系列论文研读目录
论文题目含义
Abstract
1 Introduction
2 Prompt Tuning
5.
6.
7.
8.
9.
10.

论文链接

论文题目含义

刻度在参数高效快速调优中的作用

Abstract

In this work, we explore “prompt tuning,” a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This ﬁnding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simpliﬁcation of the recently proposed “preﬁx tuning” of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers beneﬁts in robustness to domain transfer and enables efﬁcient “prompt ensembling.”在这项工作中，我们探索“提示调优”，一个简单而有效的机制，学习“软提示”条件冻结的语言模型，以执行特定的下游任务。与GPT-3使用的离散文本提示不同，软提示是通过反向传播学习的，可以调整以包含来自任何数量的标记示例的信号。我们的端到端学习方法远远优于GPT-3的少量学习。更值得注意的是，通过使用T5对模型大小进行烧蚀，我们表明，及时调整变得更具竞争力：随着模型超过数十亿个参数，我们的方法“缩小了差距”，并与模型调整的强大性能相匹配（其中所有模型权重都被调整）。这一发现尤其重要，因为大型模型的共享和服务成本很高，而将一个冻结模型重用于多个下游任务的能力可以减轻这一负担。我们的方法可被视为李及梁（2021）最近提出的“预置调谐”的简化，并提供与此方法及其他类似方法的比较。最后，我们证明了用软提示调节冻结模型可以提高域转移的鲁棒性，并实现有效的“提示集成”。

1 Introduction

With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or “ﬁne-tuning”), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).随着预训练的大型语言模型的广泛成功，出现了一系列技术来使这些通用模型适应下游任务。埃尔莫（Peters等人，2018年）提出冻结预训练的模型，并学习其每层表示的特定任务权重。然而，由于GPT（拉德福等人，2018年）和BERT（Devlin等人，2019），主要的适应技术是模型调整（或“微调”），其中所有模型参数在适应过程中进行调整，如霍华德和鲁德（2018）所提出的。
More recently, Brown et al. (2020) showed that prompt design (or “priming”) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to “freezing” pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks.最近，Brown等人（2020）表明提示设计（或“引发”）在通过文本提示调节冷冻GPT-3模型的行为方面令人惊讶地有效。任务集通常由任务描述和/或几个规范示例组成。这种“冻结”预训练模型的回归很有吸引力，特别是在模型大小不断增加的情况下。一个通用模型可以同时服务于许多不同的任务，而不是为每个下游任务提供一个单独的模型副本。
Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points be low fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters.不幸的是，基于网络的适应有几个关键的缺点。任务描述容易出错，需要人工参与，提示的有效性受到模型输入中可以容纳多少条件文本的限制。因此，下游任务质量仍然远远落后于调优模型。例如，GPT-3 175 B在SuperGLUE上的几次发射性能是17.5分，而T5-XXL被低微调（Raffel等人，2020年）（71.8 vs. 89.3），尽管使用了16倍的参数。
Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning.最近已经提出了自动化提示设计的几种努力。Shin等人（2020）提出了一种在离散单词空间上的搜索算法，由下游应用程序训练数据指导。虽然此技术优于手动提示设计，但相对于模型调整仍存在差距。
Li and Liang (2021) propose “prefix tuning” and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks.Li和Liang（2021）提出了“前缀调整”，并在生成性任务上显示了强有力的结果。此方法冻结模型参数，并在调整到编码器堆栈中每个层（包括输入层）的前缀激活期间反向传播错误。Hambardzumyan等人（2021）通过将可训练参数限制在屏蔽语言模型的输入和输出子网络中来简化这一配方，并在分类任务上显示了合理的结果。
In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This “soft prompt” is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2).在本文中，我们提出了及时调整作为进一步简化适应语言模型。我们冻结整个预训练模型，只允许每个下游任务的额外k个可调令牌被预先添加到输入文本中。这种“软提示”是端到端训练的，可以从完整的标记数据集中浓缩信号，使我们的方法能够优于少量提示，并通过模型调整缩小质量差距（图1）。同时，由于单个预训练模型被回收用于所有下游任务，因此我们保留了冻结模型的高效服务优势（图2）。
在这里插入图片描述图1：T5的标准模型调优实现了强大的性能，但需要为每个最终任务存储单独的模型副本。我们对T5的快速调优与模型调优的质量相匹配，同时支持对所有任务重用单个冻结模型。我们的方法显着优于使用GPT-3的fewshot提示设计。我们显示了调整方法的3次运行的平均值和标准差。
在这里插入图片描述图二：模型调优需要为每个下游任务制作整个预训练模型的特定于任务的副本，并且推理必须在单独的批次中执行。提示调优只需要为每个任务存储一个小的特定于任务的提示，并使用原始的预训练模型实现混合任务推理。对于T5“XXL”模型，调优模型的每个副本需要110亿个参数。相比之下，我们的调优提示每个任务只需要20，480个参数-减少了超过五个数量级-假设提示长度为5个标记。

While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale.虽然我们与Li和Liang（2021）以及Hambardzumyan等人（2021）同时开发了我们的方法，但我们是第一个证明即时调整（没有中间层前缀或特定于任务的输出层）足以与模型调整竞争的人。通过第2-3节中的详细实验，我们证明了语言模型能力是这些方法成功的关键因素。如图1所示，即时调优与规模相比更具竞争力。
We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the “generalist” parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that “prompt ensembling”, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are:我们在第4节中比较了类似的方法。将特定于任务的参数与一般语言理解所需的“通才”参数明确分离有一系列额外的好处。我们在第5节中表明，通过在提示符中捕获任务定义，同时保持通才参数固定，我们能够实现对域转移的更好的弹性。在第6节中，我们展示了“提示集成”，即为同一任务学习多个提示，可以提高质量，并且比经典模型集成更有效。最后，在第7节中，我们研究了我们学习的软提示的可解释性。总而言之，我们的主要贡献是：

Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing “prompt ensembling” and showing its effectiveness.1.提出了快速调优，并在大型语言模型的制度中显示了其与模型调优的竞争力。2.扩展了许多设计选择，并显示出质量和鲁棒性随着规模的增加而提高。3.在域转移问题上，显示提示调优优于模型调优。4.提出“即时汇编”并展示其有效性。