InstructionGPT

之前是写在[LLM：提示学习Prompt Learning]里的，抽出来单独讲一下。

基本原理

InstructGPT/ChatGPT都是采用了GPT-3的网络结构，通过指示学习构建训练样本来训练一个反应预测内容效果的奖励模型（RM），最后通过这个奖励模型的打分来指导强化学习模型的训练。

模型结构

学习步骤

Step1：先采样一些demonstration数据，其包括prompt和labeled answer。基于这些标注的数据，对GPT-3进行fine-tuning，得到SFT（Supervised Fine-tuning）；即使用采集的新数据，按照GPT-3的训练方式对GPT-3进行微调。（此时的SFT模型在遵循指令/对话方面已经优于 GPT-3，但不一定符合人类偏好。）

SFT数据一部分来自使用OpenAI的PlayGround的用户，另一部分来自OpenAI雇佣的40名标注工（labeler）。并且他们对labeler进行了培训。在这个数据集中，标注工的工作是根据内容自己编写指示，并且要求编写的指示满足下面三点：简单任务：labeler给出任意一个简单的任务，同时要确保任务的多样性；Few-shot任务：labeler给出一个指示，以及该指示的多个查询-响应对；用户相关的：从接口中获取用例，然后让labeler根据这些用例编写指示。

Step2：Fine-tuning完之后，再给一个prompt让SFT模型生成出若干结果（可以通过beam search等方法），通过人工为其排序，可以得到标注的排序pair；基于标注的排序结果（来自于Human Feedback），训练一个Reward Model；具体为：对多个排序结果，两两组合，形成多个训练数据对。RM模型接受一个输入，给出评价回答质量的分数。这样，对于一对训练数据，调节参数使得高质量回答的打分比低质量的打分要高。

奖励模型的损失函数：最大化labeler更喜欢的响应和不喜欢的响应之间的差值。

Step3：继续用生成出来的结果训练SFT，并通过强化学习的PPO方法，最大化SFT生成出排序靠前的answer。即使用第二步训练得到的 reward model 和 PPO 强化学习算法，对第一步的模型进行 fine-tune.

[ChatGPT 背后的“功臣”——RLHF 技术详解]

PPO-ptx模型

PPO-ptx模型训练目标：

随着模型的更新，强化学习模型产生的数据和训练奖励模型的数据的差异会越来越大。作者的解决方案是在损失函数中加入KL惩罚项来确保PPO模型的输出和SFT的输出差距不会很大。
只用PPO模型进行训练的话，会导致模型在通用NLP任务上性能的大幅下降，作者的解决方案是在训练目标中加入了通用的语言模型目标，这个变量在论文中被叫做PPO-ptx。

初始化时，期望在训练πϕ时能够最大化reward的得分。

为什么不用 Reward-Model 的数据直接fine-tune而用 RL

用 supervised finetune（sft）很容易做到返回“我不知道”，但是很难让模型不去编内容。或者说，在标注的时候，对于一个模型不会的问题，标注员应该要标注为“不知道”，而不是给一个回答，不然 sft 的时候相当于训练模型怎么编答案。
我们最好划出一个边界，哪些是模型不知道的，哪些是知道但是不确定的，哪些是知道的，并通过训练加强这些不同分类的分界，让模型能稳定回答它明确知道的东西，让模型不要回答它不知道的/错误的内容，这样才能保持或提升模型的 factual truthfulness；
如果我们可以有一个 reward model，他根据这个边界返回 reward 就好了，这样的话模型的训练就能集中于我们上面描述的目标。这也是为啥他们要用 RL。但是实际上 InstructGPT 里的 loss 也并不是以找到这个边界为指标，而是优化正例和负例之间的差别。这使得当前的 reward model 没有做到预期的效果，还有优化的空间。

[ChatGPT 为什么不用 Reward-Model 的数据直接 fine-tune，而用 RL？ - 知乎]

训练细节

数据

‘instruction-style’的user prompts示例

Illustrative user prompts from InstructGPT distribution

Use Case	Example
brainstorming	What are 4 questions a user might have after reading the instruction manual for a trash compactor? {user manual} 1.
classification	Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation {text} Rating:
	{java code} What language is the code above written in?
extract	Extract all course titles from the table below: \| Title \| Lecturer \| Room \| \| Calculus 101 \| Smith \| Hall B \| \| Art History \| Paz \| Hall A \|
	Given the following list of movie titles, write down any names of cities in the titles. {movie titles}
generation	Write a creative ad for the following product to run on Facebook aimed at parents: Product: {product description}
	Write a short story where a brown bear to the beach, makes friends with a seal, and then return home.
	Here’s a message to me: — {email} — Here are some bullet points for a reply: — {message} — Write a detailed reply
rewrite	Translate this sentence to Spanish: <English sentence>
	Rewrite the following text to be more light-hearted: — {very formal text}—
hatZ	The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. Human: Hello, who are you? AI: I am an AI created by OpenAI. How can I help you today? Human: I’d like to cancel my subscription. AI:
closed qa	Help me answer questions about the following short story: {story} What is the moral of the story?
	Answer the following question: What shape is the earth? A) A circle B) A sphere C) An ellipse D) A plane
open qa	I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown". Q: What is human life expectancy in the United States? A: Human life expectancy in the United States is 78 years. Q: Who was president of the United States in 1955? A:
	How do you take the derivative of the sin function?
summarization	Summarize this for a second-grade student: {text}
	{news article} Tl;dr:
other	start with where
	Look up "cowboy" on Google and give me the results.

less ‘instruction-style’式的GPT-3 user prompts示例

Illustrative user prompts from GPT-3 distribution: These are generally less ‘instruction-style’, and contain more explicit prompting. Note that there are some prompts where the user intent is unclear. 给人的感觉就是更more-shot一些，更in context一些，提示更多更明确。另外instructgpt有点命令的意思，如closed qa时，instruct要先说一下“请回答问题”，然后再提问？

Use Case	Example
brainstorming	Tell me a list of topics related to: - interior design - sustainable ecosystems - fake plants
classification	This is a tweet sentiment classifier. {tweet} Sentiment: negative === {tweet} Sentiment: neutral === {tweet} Sentiment:
extract	Text: {text} Keywords:
generation	This is the research for an essay: === {description of research} === Write a high school essay on these topics: ===
closed qa	When you drop a heavy stone from a tree, what happens? A. The stone falls to the ground. B: The stone stays in the tree. C: The stone floats. D: Nothing happens. Answer:

评价

1.3B 参数 InstructGPT 模型的输出优于 175B GPT-3 的输出，尽管参数少了 100 多倍。

InstructGPT/ChatGPT的效果是非常棒的，尤其是引入了人工标注之后，让模型的“价值观”和的正确程度和人类行为模式的“真实性”上都大幅的提升。

优点

InstructGPT/ChatGPT的效果比GPT-3更加真实：这个很好理解，因为GPT-3本身就具有非常强的泛化能力和生成能力，再加上InstructGPT/ChatGPT引入了不同的labeler进行提示编写和生成结果排序，而且还是在GPT-3之上进行的微调，这使得我们在训练奖励模型时对更加真实的数据会有更高的奖励。作者也在TruthfulQA数据集上对比了它们和GPT-3的效果，实验结果表明甚至13亿小尺寸的PPO-ptx的效果也要比GPT-3要好。
InstructGPT/ChatGPT在模型的无害性上比GPT-3效果要有些许提升：原理同上。但是作者发现InstructGPT在歧视、偏见等数据集上并没有明显的提升。这是因为GPT-3本身就是一个效果非常好的模型，它生成带有有害、歧视、偏见等情况的有问题样本的概率本身就会很低。仅仅通过40个labeler采集和标注的数据很可能无法对模型在这些方面进行充分的优化，所以会带来模型效果的提升很少或者无法察觉。
InstructGPT/ChatGPT具有很强的Coding能力：首先GPT-3就具有很强的Coding能力，基于GPT-3制作的API也积累了大量的Coding代码。而且也有部分OpenAI的内部员工参与了数据采集工作。通过Coding相关的大量数据以及人工标注，训练出来的InstructGPT/ChatGPT具有非常强的Coding能力也就不意外了。

缺点

InstructGPT/ChatGPT会降低模型在通用NLP任务上的效果：我们在PPO的训练的时候讨论了这点，虽然修改损失函数可以缓和，但这个问题并没有得到彻底解决。
        有时候InstructGPT/ChatGPT会给出一些荒谬的输出：虽然InstructGPT/ChatGPT使用了人类反馈，但限于人力资源有限。影响模型效果最大的还是有监督的语言模型任务，人类只是起到了纠正作用。所以很有可能受限于纠正数据的有限，或是有监督任务的误导（只考虑模型的输出，没考虑人类想要什么），导致它生成内容的不真实。就像一个学生，虽然有老师对他指导，但也不能确定学生可以学会所有知识点。
        模型对指示非常敏感：这个也可以归结为labeler标注的数据量不够，因为指示是模型产生输出的唯一线索，如果指示的数量和种类训练的不充分的话，就可能会让模型存在这个问题。
        模型对简单概念的过分解读：这可能是因为labeler在进行生成内容的比较时，倾向于给给长的输出内容更高的奖励。
        对有害的指示可能会输出有害的答复：例如InstructGPT/ChatGPT也会对用户提出的“AI毁灭人类计划书”给出行动方案（图5）。这个是因为InstructGPT/ChatGPT假设labeler编写的指示是合理且价值观正确的，并没有对用户给出的指示做更详细的判断，从而会导致模型会对任意输入都给出答复。虽然后面的奖励模型可能会给这类输出较低的奖励值，但模型在生成文本时，不仅要考虑模型的价值观，也要考虑生成内容和指示的匹配度，有时候生成一些价值观有问题的输出也是可能的。

from: -柚子皮-

ref: [InstructGPT: Training language models to follow instructions with human feedback]

[ChatGPT/InstructGPT详解 - 知乎]