前言

想要微调一个大模型，前提是得有一份高质量的SFT数据，可以这么说其多么高质量都不过分，关于其重要性已经有很多工作得以验证，感兴趣的小伙伴可以穿梭笔者之前的一篇文章：

《大模型时代下数据的重要性》：https://zhuanlan.zhihu.com/p/639207933

今天我们来简单总结一下目前市面上 “怎么自动化准备SFT数据” 这个话题，并给出对应的参考文献，感兴趣的小伙伴可以自己阅读论文了解细节。注意这里只是介绍机器自动化生成数据，如果有垂类网站等能够爬取真实人类的数据那更好了。

SFT数据无外乎就是<prompt, response> pair，也就是需要准备好高质量的prompt和高质量的response。

那我们就以这两个角度分开看看现在的paper们是怎么分别在prompt和response发力的。

准备prompt

SELF-INSTRUCT

代表工作有:

《SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions》: https://arxiv.org/pdf/2212.10560.pdf

BELLE : https://github.com/LianjiaTech/BELLE

其核心是先有一批sft种子数据，然后通过few-shot的形式让模型再生成新的prompt，它的prompt 如下：

You are asked to come up with a set of 30 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.
10. Make sure the output is gramatically correct with punctuation if needed.
List of 30 tasks:

Wizard

该系列思路是进化学习，其代表工作有：

WizardLM: https://arxiv.org/abs/2304.12244

WizardCoder: https://arxiv.org/abs/2306.08568

WizardMath: https://github.com/nlpxucan/WizardLM/tree/main/WizardMath

上面两幅图也很生动的表达了其思路，其核心也是先有一批种子prompt，然后让模型自动化生产新的prompt，只不过不同于self-instruct的是其做了更精细的prompt engineering来进行生成。具体来说其从深度和广度两个方向对prompt进行了进化。

深度进化：其旨在不改变原prompt语义情况下增加难度，具体包括五种操作：添加约束条件、加深理解、具体化、增加推理步骤和复杂化输入。一个prompt engineering如下：

I want you act as a Prompt Rewriter.
Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.
But the rewritten prompt must be reasonable and must be understood and responded by humans.
Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#.
You SHOULD complicate the given prompt using the following method:
Please add one more constraints/requirements into #Given Prompt#
You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.
‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in #Rewritten Prompt#
#Given Prompt#:
<Here is instruction.>
#Rewritten Prompt#:

广度进化：基于给定指令生成一个全新的指令，一个prompt engineering如下：

I want you act as a Prompt Creator.
Your goal is to draw inspiration from the #Given Prompt# to create a brand new prompt.
This new prompt should belong to the same domain as the #Given Prompt# but be even more rare.
The LENGTH and difficulty level of the #Created Prompt# should be similar to that of the #Given Prompt#. The #Created Prompt# must be reasonable and must be understood and responded by humans.
‘#Given Prompt#’, ‘#Created Prompt#’, ‘given prompt’ and ‘created prompt’ are not allowed to appear in #Created Prompt#.
#Given Prompt#:
<Here is instruction.>
#Created Prompt#:

当然并不是每次进化出的prompt都是可用的，所以需要过滤掉，以下四类情况被归类为指令演化失败：

1 与原始指令相比，进化后的指令没有提供任何信息增益（使用ChatGPT进行验证）；

2 生成出的指令让LLMs难以回复；（例如当回复包含sorry而且比较短的时候）

3 LLMs生成的回复只包含标点和停止词；

4 进化指令显然从演化中的提示中复制了一些单词，如“给定提示”、“重写提示”、“#重写提示#”等。

Backtranslation

这里的思路另辟蹊径即使用回译的思路，代表工作有：

《Self-Alignment with Instruction Backtranslation》： https://arxiv.org/pdf/2308.06259.pdf

该工作的前提是要求有一大批未标注的样本比如非常大的文本集（未标记样本集）、sft种子数据。然后用<response, prompt>去训练一个模型论文叫做backward model，可以看到其和sft模型恰好相反，backward model是根据response生成prompt，当训练好模型后就可以喂未标注的大规模样本生成prompt了。

准备response

sample distillation

最简单也是目前大家最常用的方法就是根据prompt去直接拉取chatgpt甚至是GPT4的回复作为response即直接蒸馏openai，因为chatgpt和GPT4两个天花板模型本身能力非常强，这在一定程度上能够保证response的质量。

cot distillation

该类思路是说即使是chatgpt和GPT4也不能精确的回复一些高难度的prompt比如数学和推理类prompt，导致自动化得到的response不可用，为此可以做一些思维链来辅助其生成更可行的response，最简单的cot就是大家常见的增加一些类似“请详细给出答案”或者“请一步步推理”等等，下面我们介绍几篇比较复杂的cot工作：

《Tree of Thoughts: Deliberate Problem Solving with Large Language Models》：https://arxiv.org/abs/2305.10601

也是从深度&广度两个方向进行思维链，论文中将其称为cot的加强版ToT，对多个思维链进行采样，并将它们的结果进行投票可以进一步提高LLMs的推理准确性。

《Answering Questions by Meta-Reasoning over Multiple Chains of Thought》：https://arxiv.org/pdf/2304.13007.pdf

简单来说思路就是通过外挂知识库也就是paper中说的证据来提高回复的准确性，首先把原始prompt进行拆解，然后分别检索证据即多条思维链，最后进行汇总。

《SCALING RELATIONSHIP ON LEARNING MATHEMATI- CAL REASONING WITH LARGE LANGUAGE MODELS》：https://arxiv.org/pdf/2308.01825.pdf

该篇paper重在解决数学问题。其基本思路是同一个prompt去调用不同的大模型进而得到不同的解题思路也即不同的思维链，当然要过滤掉最后结果不正确的思维链。通过这样就得到了大量的数据集。

总结

从上面的系列工作不难看出，都是使用LLM模型来自己生成prompt和response数据。总体生成idea就是由简单->复杂。

prompt : 由简单self-instruct -> 复杂的进化学习

response : 简单蒸馏 -> cot蒸馏 -> 多模型蒸馏

复杂化的手段有很多，比如上面的进化学习、cot、多模型等等，我们可以将这些方法论分别进一步互相借鉴套用到prompt和response各种领域，比如将多模型蒸馏用到生成prompt，用多个模型分别做进化学习生成promot，再比如借鉴进化学习，我们可以对一个response进行rewrite，让他写的更长更复杂甚至我们给模型一个<prompt, response>，然后让他根据目前的prompt写一个比目前reponse更好的reponse。

除了互相借鉴，我们还可以将这些方法论进行任意叠加组合，比如先进行得到prompt，然后多模型cot得到response等等。

总之大家可以多多脑洞造数据实验啦～～