DeepSeek R1 “顿悟时刻”(Aha Moment) 的重现与探索：基于 GRPO 的倒计时游戏训练

本文翻译整合转载于：

Deepseek R1 是如何训练的
Mini-R1：重现 Deepseek R1 的 “顿悟时刻” RL 教程

Deepseek R1 的发布震惊了整个行业。为什么？DeepSeek-R1 是一个开放模型，在复杂推理任务中可与 OpenAI 的 o1 相媲美，它使用组相对策略优化（GRPO）和以 RL 为中心的多阶段训练方法引入。他们不仅发布了该模型，还发布了一份关于他们如何做到这一点的研究论文。

在论文中，他们描述了使用纯 RL 训练模型时的 “顿悟时刻”。在此阶段，DeepSeek-R1-Zero（DeepSeek-R1 的第一个测试）通过重新评估其初始方法来学习为问题分配更多的思考时间，而无需任何人工反馈或描述如何执行的数据。他们将这描述为“顿悟时刻”：

这种行为不仅证明了模型不断增长的推理能力，也是强化学习如何导致意想不到的复杂结果的迷人例子。

在这篇博文中，我们想使用组相对策略优化（GRPO）和倒计时游戏重现 DeepSeek-R1 的小“顿悟时刻”。我们将使用强化学习训练一个开放模型，尝试自行教它自我验证和搜索能力，以解决倒计时游戏。Countdown 游戏是一个数字拼图游戏，玩家使用一组随机抽取的数字和基本的算术运算（+、-、×、÷）来达到或尽可能接近目标数字。

Target Number: 952
Available Numbers: 25, 50, 75, 100, 3, 6

(100 × (3 × 3)) + (50 + 6 / 3) = 952

文章包括一个交互式代码，您可以在 Jupyter Notebook 中运行该代码，了解如何使用 GRPO 和 Q-Lora 训练模型。这是学习如何使用 TRL 和 GRPO 的好方法，但它非常慢并且需要大量的计算。此外，我还添加了一个脚本和说明，用于在具有多个 GPU 或 SLURM 集群的 Node 上运行训练。

但在开始之前，让我们先看一下组相对策略优化（GRPO）并了解它是如何工作的。

组相对策略优化（GRPO）

DeepSeek AI 发布了 DeepSeek-R1，这是一个开放模型，在复杂推理任务中可与 OpenAI 的 o1 相媲美，使用组相对策略优化（GRPO）和以 RL 为中心的多阶段训练方法引入。

了解组相对策略优化（GRPO）

组相对策略优化（GRPO）是一种强化学习算法，用于提高 LLMs的推理能力。它是在 DeepSeekMath 论文的数学推理背景下引入的。GRPO 通过消除对价值函数模型的需求来修改传统的近端策略优化（PPO）。相反，它根据组分数估计基线，从而减少内存使用和计算开销。GRPO 现在也被 Qwen 团队使用，可以与基于规则/二进制文件的奖励以及通用奖励模型一起使用，以改进有用性模型。

采样：使用当前策略为每个提示生成多个输出
奖励评分：每一代都使用奖励函数进行评分，可以是（基于规则或基于结果）
Advantage Calculation：生成的产出的平均奖励用作基线。然后，相对于此基线计算组内每个解的优势。奖励在组内标准化。
策略优化：该策略试图最大化 GRPO 目标，其中包括计算的优势和 KL 背离项。这与 PPO 在奖励中实施 KL 期限的方式不同。

与近端策略优化（PPO）的主要区别是

无值函数：与 PPO 不同，GRPO 不依赖于单独的值函数模型，这简化了训练并减少了内存消耗。
基于组的优势： GRPO 使用一组产出的平均奖励作为基线。这种方法更符合奖励模型训练的性质，奖励模型训练通常针对一个输入检查多个输出。
KL Divergence （吉隆坡发散）：GRPO 直接将 KL 背离项合并到损失函数中，而 PPO 经常将其用作奖励信号的一部分。

纯强化学习（R1-zero）

在构建 DeepSeek R1 的过程中，该团队通过对其基础模型进行强化学习实验，获得了深刻的见解。从 DeepSeek V3 开始，他们将 GRPO 应用于无监督推理文本补全、基于规则的奖励模型，这些模型侧重于格式、数学和编码等方面：

准确性奖励：评估响应是否正确、结果正确或编译的 LeetCode 问题。
格式奖励：评估强制模型将其思维过程置于 '' 和 '' 标签之间的格式。

这导致 AIME 2024 的 pass@1 分数从 15.6% 提高到 71.0%，达到与 OpenAI-o1-0912 相当的性能水平，同时每个问题的输出令牌长度增加，表明该模型自然而然地学会了解决具有更多思考时间/令牌生成的任务。

这缺点是导致可读性和语言混合不佳，但在 R1 中，使用交替 SFT → RL 步骤的多阶段方法解决了这个问题。

DeepSeek R1 的多阶段训练

为了防止基础模型出现强化训练（RL）训练的早期不稳定冷启动阶段，该团队从监督微调开始。

阶段 1/4 基础到监督微调（SFT）

使用微调模型、R1-zero 和人工注释器收集了多达 10k 个代币长的思维链（CoT）。这些数据用于微调 Deepseek V3 基础，以提高可读性和连贯性。

阶段 2/4 推理 RL

使用与 R1-Zero 相同的 RL 管道，使用相同的基于规则的奖励模型专注于推理密集型任务，例如编码和数学。这一次，对 “language consistency” 的额外奖励用于帮助模型坚持使用同一种语言。

Stage 3/4 抑制采样和 SFT

使用拒绝采样（RS）生成大型合成数据集，专注于写作、角色扮演和其他通用任务。第 2 阶段的模型与 Deepseek V3 一起用作裁判，以生成 600k 与推理相关的样本和 200k 用于写作、角色扮演和其他通用任务，使用 DeepSeek-V3 的 SFT 数据集的一部分或重新生成它们包括 CoT。

阶段 4/4 RL for Helpness

在最后阶段，GRPO 再次与基于规则的模型和结果奖励模型结合使用，以提高模型的有用性和无害性。引出 Deepseek R1 模型。

DeepSeek 没有使用 Monte Carlo Tree Search （MCTS）或流程奖励模型（PRM）。
在应用 GRPO 之前进行微调实际上可以使训练过程更快、更稳定。
专注于准确性和格式的基于规则的奖励比复杂的奖励模型更有效。

1. 设置开发环境

我们的第一步是安装 Hugging Face Libraries 和 Pytorch、vllm 和 trl、transformer 和数据集。如果您还没有听说过 trl，请不要担心。它是 transformer 和 datasets 之上的新库，可以更轻松地微调、rlhf、align open LLMs。

# Install Pytorch & other libraries, make sure to match your GPU driver version%pip install "torch==2.5.1" tensorboard "setuptools<71.0.0"  --index-url https://download.pytorch.org/whl/cu121 # Install flash-attn%pip install flash-attn  # Install Hugging Face libraries%pip install  --upgrade \  "transformers==4.48.1" \  "datasets==3.1.0" \  "accelerate==1.3.0" \  "hf-transfer==0.1.9" \  "deepspeed==0.15.4" \  "trl==0.14.0" # install vLLM %pip install "vllm==0.7.0" ## IMPORTANT: If you want to run the notebook and the interactive cells you also need to install the following libraries:# But first read it the blog post and then decide as they might conflict with the libraries for distributed training. # %pip install "peft==0.14.0" "bitsandbytes==0.45.0"

注意：您可能需要重启内核才能使用更新的软件包。

我们将使用 Hugging Face Hub 作为远程模型版本控制服务。这意味着我们将在训练期间自动将模型、日志和信息推送到 Hub。您必须在 Hugging Face 上注册才能执行此作。在您拥有帐户后，我们将使用 huggingface_hub 包中的 login util 登录我们的帐户并将我们的令牌（访问密钥）存储在磁盘上。

from huggingface_hub import login login(token="", add_to_git_credential=True) # ADD YOUR TOKEN HERE

2. 从 Countdown Game 生成带有 reasoning 前缀的训练样本

我们将使用 Jiayi-Pan/Countdown-Tasks-3to4 数据集，其中包含具有 3 到 4 个数字和解的样本。

作为模型，我们将使用 Qwen/Qwen2.5-3B-Instruct，这是一个 3B 参数指令调整模型。这使得展示 “顿悟时刻” 变得更加容易，因为它已经遵循提示格式。但您也可以使用 Qwen 的基础版本或其他模型。Jiayi-Pan 探索，模型需要具备一定的素质才能学会推理过程，从> 1.5B 参数开始。

from transformers import AutoTokenizerfrom datasets import load_dataset # Load dataset from Hugging Face Hubdataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"dataset = load_dataset(dataset_id, split="train")# select a random subset of 50k samplesdataset = dataset.shuffle(seed=42).select(range(50000)) # Load tokenizer from Hugging Face Hub to format the dataset to our "r1" prompt tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct") # gemerate r1 prompt with a prefix for the model to already start with the thinking processdef generate_r1_prompt(numbers, target):    r1_prefix = [{        "role": "system",        "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."      },      {         "role": "user",        "content": f"Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 = 1 </answer>."      },      {        "role": "assistant",        "content": "Let me solve this step by step.\n<think>"      }]    return {"prompt": tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), "target": target} # convert our dataset to the r1 promptdataset = dataset.map(lambda x: generate_r1_prompt(x["nums"], x["target"])) # split the dataset into train and testtrain_test_split = dataset.train_test_split(test_size=0.1) train_dataset = train_test_split["train"]test_dataset = train_test_split["test"]

3. 使用 GRPO 训练模型（教育部分）

注意：第 3 节显示了如何使用 TRL 和 GRPO 的基本知识。如果要运行交互式单元，则需要安装 bitsandbytes 和 peft，因为它们是 Trainer 类所必需的。本节主要用于教育目的。

TRL 通过专用的 GRPOTrainer 支持组相对策略优化（GRPO），用于LLMs从偏好数据进行对齐，如 DeepSeekMath：在开放语言模型中突破数学推理的极限中所述。GRPOTrainer 是 transformers 库中 Trainer 的子类，支持所有相同的功能，包括日志记录、检查点、分布式训练和参数高效微调（PEFT）。

GRPOTrainer 支持通用结果奖励模型（ORM）和自定义奖励函数，可用于实施基于规则的奖励模型。在 Deepseek R1 论文中，他们实施了基于规则的奖励模型来验证生成的解决方案的正确性。在我们的示例中，我们将执行类似的方法，其中我们将创建 2 个奖励函数，它们：

Format Reward：检查生成的格式是否正确 <think> [thinking] </think><answer> [answer] </answer>
Accuracy Reward：从 <answer> 标签中提取方程式，并根据目标以及每个数字是否都使用一次对其进行评估。

注意：在我们的示例中，正确的 <answer> 包括方程式，例如 <answer> 55 + 36 - 7 - 19 </answer>

import re def format_reward_func(completions, target, **kwargs):    """    Format: <think>...</think><answer>...</answer>    Args:        completions (list[str]): Generated outputs        target (list[str]): Expected answers            Returns:          list[float]: Reward scores    """    rewards = []     for completion, gt in zip(completions, target):       try:        # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex        completion = "<think>" + completion                # Check if the format is correct        regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*)<\/think>\n<answer>([\s\S]*?)<\/answer>$"         match = re.search(regex, completion, re.DOTALL)         # if the format is not correct, reward is 0        if match is None or len(match.groups()) != 2:            rewards.append(0.0)        else:            rewards.append(1.0)      except Exception:        rewards.append(0.0)    return rewards def equation_reward_func(completions, target, nums, **kwargs):    """    Evaluates completions based on:    2. Mathematical correctness of the answer     Args:        completions (list[str]): Generated outputs        target (list[str]): Expected answers        nums (list[str]): Available numbers        Returns:        list[float]: Reward scores    """    rewards = []    for completion, gt, numbers in zip(completions, target, nums):      try:        # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex        completion = "<think>" + completion        # Check if the format is correct        match = re.search(r"<answer>(.*?)<\/answer>", completion)        if match is None:            rewards.append(0.0)            continue        # Extract the "answer" part from the completion        equation = match.group(1).strip()        # Extract all numbers from the equation        used_numbers = [int(n) for n in re.findall(r'\d+', equation)]                # Check if all numbers are used exactly once        if sorted(used_numbers) != sorted(numbers):            rewards.append(0.0)            continue        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace        allowed_pattern = r'^[\d+\-*/().\s]+$'        if not re.match(allowed_pattern, equation):           rewards.append(0.0)           continue                # Evaluate the equation with restricted globals and locals        result = eval(equation, {"__builtins__": None}, {})        # Check if the equation is correct and matches the ground truth        if abs(float(result) - float(gt)) < 1e-5:            rewards.append(1.0)        else:            rewards.append(0.0)      except Exception:            # If evaluation fails, reward is 0            rewards.append(0.0)     return rewards

让我们用一个示例来尝试我们的 reward 函数。

注意：该示例均不以 <think> 开头，因为我们将其综合添加到提示符中。

correct_sample_1 = """We need to find an equation using the numbers 19, 36, 55, and 7exactly once, with basic arithmetic operations, that equals 65. One possiblecombination is 55 + 36 - 19 + 7... </think><answer> 55 + 36 - 7 - 19 </answer>""" correct_sample_2 = """ ... </think><answer> 55 + 36 - 7 - 19 </answer>""" wrong_format = """User: Using the numbers [19, 36, 55, 7], create an equation that equals 65.""" wrong_format_2 = """To find the equation that equals 79 using the numbers 95, 78, 6, 88, I'll start by adding 88 and 95:                      95 + 88 = 183                                                                                                              Now, let's subtract 104 from 183 to get 79:183 - 104 = 79<think> 183 - 104 = 79 </think><think> 183 - 104 = 79 </think><answer> 183 - 104 = 79 </answer>""" wrong_result = """ ... </think><answer> 55 + 36 - 7 - 18 </answer>"""  test_rewards = format_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)assert test_rewards == [1.0, 1.0, 0.0, 0.0, 1.0], "Reward function is not working"test_rewards = equation_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)assert test_rewards == [1.0, 1.0, 0.0, 0.0, 0.0], "Reward function is not working"

这看起来不错，现在让我们定义剩余的训练参数，创建一个训练器并开始训练。

from trl import GRPOConfig, GRPOTrainer, get_peft_config, ModelConfig # our model we are going to use as policy model_config = ModelConfig(    model_name_or_path="Qwen/Qwen2.5-3B-Instruct",    torch_dtype="bfloat16",    attn_implementation="flash_attention_2",    use_peft=True,    load_in_4bit=True,) # Hyperparameterstraining_args = GRPOConfig(    output_dir="qwen-r1-aha-moment",    learning_rate=5e-7,    lr_scheduler_type="cosine",    logging_steps=10,    max_steps=100,    per_device_train_batch_size=1,    gradient_accumulation_steps=1,    gradient_checkpointing=True,    gradient_checkpointing_kwargs={"use_reentrant": False},    bf16=True,    # GRPO specific parameters    max_prompt_length=256,    max_completion_length=1024, # max length of the generated output for our solution    num_generations=2,    beta=0.001,    )trainer = GRPOTrainer(    model=model_config.model_name_or_path,    reward_funcs=[format_reward_func, equation_reward_func],    args=training_args,    train_dataset=train_dataset,    eval_dataset=test_dataset,    peft_config=get_peft_config(model_config),)

我们可以通过在 trainer 实例上调用 train 方法来开始训练。

注意：强化训练非常慢，并且计算密集。使用 Q-LoRA 在 1x L4 上运行单个步骤，批次大小为 1，每个样品仅 2 代，需要 >20 分钟。

# Train and push the model to the Hubtrainer.train()# Save modeltrainer.save_model(training_args.output_dir)

4. 使用 Deepspeed 和 vLLM 的 GRPO 分布式训练示例

每个样品只有 2 代，每步超过 20 分钟是不可行的。我们需要扩大我们的培训规模。Hugging Face TRL 增加了对使用 Deepspeed 进行分布式训练的支持，并使用 vLLM 加快生成速度。我准备了一个run_r1_grpo.py脚本和一个 receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml 配置文件来运行训练。

此配置在具有 4 个 H100 80GB 的节点上进行了测试和验证，其中单个步骤大约需要 45-60 秒，因为我们可以利用 vLLM 进行生成，利用 DeepSpeed 进行分布式训练。因此，我们需要确保将 num_processes 正确设置为您拥有的 GPU 数量 - 1，因为最后一个将与 vLLM for Generation 一起使用。如果你使用更多的 GPU，你需要将配置文件中的 vllm_device 更改为最后索引的 GPU，例如，如果你有 8 个 GPU，你需要设置 vllm_device=7 并将你的 num_processes 设置为 7。

命令运行训练：

accelerate launch --num_processes 3 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/run_r1_grpo.py --config receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml

通过优化的分布式训练，在 4x H100 80GB 上每个样本 8 代的单个步骤大约需要 45-60 秒。450 个步骤的完整训练大约需要 6 小时。

5. 结果和训练观察

该脚本将随机完成保存到 completion_samples 文件夹，您可以使用该文件夹检查模型的进度。它包括 completion_samples.txt 和 success_completion_samples.txt .completion_samples.txt 包括所有完成，而 success_completion_samples.txt 则正确求解方程。您可以在下面找到有关性能如何随时间变化的有趣训练 obesing，以及 Tensornoard 日志和成功推理示例。

每 25 步都有检查点的模型可以在 philschmid/qwen-2.5-3b-r1-countdown 中找到。

超参数

我使用 DeepSeekMath 论文中的超参数开始实验，学习率为 1e-6，beta（KL 系数）为 0.04，这导致在大约 150 步后训练运行不稳定。我运行了一些小的消融，并根据 OpenRLHF 的测试将学习率降低到 5e-7，将测试版降低到 0.001。我无法测试将 num_generations 从 8 增加到 64 会如何影响训练。64 是 DeepSeekMath 论文中使用的生成值。所有其他参数都可以在 grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml 配置文件中找到。

训练观察：

在 ~50 步时，模型已学习正确的格式 <think>...</think>\n<answer>...</answer> 。
在 100 步时，求解方程的成功率约为 25%。模型开始用单词 “推理” ，见下面的例子。
在 200 步时，性能似乎收敛得慢得多，我们的成功率为 ~40%。该模型开始学习一种新的“格式”，在该格式中，它通过尝试不同的组合并查看结果，类似于以编程方式求解方程，请参阅“步骤 200 和 450 之间的成功推理样本”。
在 450 个步骤中，我们求解方程的成功率为 50%。性能仍然缓慢提高，模型从步骤 200 开始保持其新的格式形式。

我有 3 个可能的假设来解释为什么该模型从 “单词推理” 转向 “程序化执行” ：

Qwen 2.5 3B 不够强，也不够小，Deepseek 提到你需要一个非常强的基础模型。
奖励函数定义得不够好，它奖励 hack 模型来求解方程。我们可以尝试强制它使用单词，例如具有数字到单词频率条件。（我们对 Deepseek 的奖励功能了解不多）
只有对 Countdown Game 任务进行训练可能会迫使模型自然地学习求解方程的最有效方法，因为不需要其他格式。
我们训练模型的时间不够长，在 R1 论文中，他们展示了 8000 多个步骤的训练视觉效果。

Tensorboard 结果

Qwen R1 Tensorboard results

成功推理样本，最高第 200 步

<think> First, I need to think about the numbers we have: 14, 70, and 80. To get 85, I think 80 might be a good starting point because it's right in between 85. If I subtract or add 80, it won't lead to 85 easily. Next, 70 seems to be a good intermediate number, especially to use in conjunction with 14. Now, I can think about different combinations using addition, subtraction and multiplication/division. If I take 80 - 70 = 10, which is not needed. So, I need to find a way to use 14. If I subtract 14 from 80, I get 66, which still isn't 85. If I subtract 70 from 80, I get 10 again, this isn't constructive yet. Multiplication could create numbers larger than the target 85. Let's consider division: If I divide 70 by 14, I get 5. Then if I add this result to 80, I get 85. This combination seems promising: 80 - (70 / 14) = 80 - 5 = 75, this still isn't quite right. Let me correct my approach: Let's use division and then addition. If I take 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 = 64, still incorrect. Let me simplify: 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 gives 64 again, let me correct: 80 - (70 / 14) → 80 - 5 = 75, but I need 85. Thus I will use the combinations in a slightly different order to insert numbers: I get 80 + (70 / 14) = 80 + 5 = 85</think><answer> 80 + (70 / 14) </answer>

<think> First, I need to consider how I can use the numbers 1, 21, and 90 to reach 68. I can try different combinations of operations. Let's try to use subtraction and addition. If I subtract 90 from 21, I get -69, which is too far from 68. Now, if I add 1 to 90, I get 91, which is also not right. Another way is to see what happens if I subtract 21 from 90, which gives me 69. This is closer but still not 68. Let me use multiplication instead of subtraction. 90 * 1 = 90. Now, if I try to reduce this slightly by subtracting 21, I am not getting close to 68.Now, let's try another multiplication. I'll think of 1 * (larger value) + (smaller value). Let's try 1 * 90 + 21. This evaluates to 90 + 21 = 111, which is too much. Instead, I will try: 90 / 1 + 21 - 7. But this gives a division by 7, which seems complex. Let's re-evaluate: I'll use 1 to divide 21: 21 / 1 = 21. Now, subtract 21 from 90, and add 1, we have 90 - 21 = 69, using one...Hey, wait on re-evaluating I will subtract 21 from 90, and include 1 to complete the solution: 90 - (21 + 1) which would be 68.</think><answer> 90 - (21 + 1) </answer>

<think> First, I need to consider the numbers we have: 59, 80, and 66. I need to use these numbers to get 45. Let's try to use subtraction first since it can quickly reduce large numbers.Let's see if we can get close by subtracting or adding these numbers. If I subtract 66 from 80, I get 14. Then I need to make this 14 equal 45 when combined with 59. That gives me 14 + 59 = 73, which is too high.Now let's try adding numbers. If I add 59 and 66, I get 125. Then I would need to subtract 80 from 125 to get 45, but 80 is too large to fit well with this.Let me try a different approach. What if I subtract 66 from 80 first and then add 59? That would be 14 + 59 = 73, which is still not 45.Hmm, I need to think of another combination. What if I start with 59 and then add 80 and subtract 66? That would be (59 + 80) - 66, let me calculate it. That equals 139 - 66, which is 73.That's still not 45, so let me try rearranging it slightly: maybe 80 - 66 + 59 would be closer to 45. Calculating again: 80 - 66 is 14, then adding 59 gives 73.This way doesn't work either. Let's flip this: maybe (59 + 66) - 80. Let me check: that's 125 - 80, which equals 45.</think><answer>(59 + 66) - 80</answer>

<think> First, I need to find a combination of the numbers 61, 4, 94, and 21 that can be manipulated using basic arithmetic operations to equal 50. One way to approach this is to look for numbers that can be used to reach a target number close to 50 and then adjust with the remaining numbers.Let's try combining 94 and 21 first:94 + 21 = 115Now, we need to get from 115 to 50. We can try subtracting 65:115 - 65 = 50Now we need to find a way to get 65 using the remaining numbers 61 and 4. We can try:61 - 4 = 57Now we need to get from 57 to 65. Adding 8 would give us that:57 + 8 = 65So, we need to find a way to get 8 using 61 and 4. We can try:61 - 94 = -36This is negative, so let's try dividing:61 / 94 = 0.648987...This is not helping, so let's try multiplication and division in a different way:(61 - 4) * (94 / 21) = 57 * 4.476... = 254.4This is not working either. Let's try a different approach:(94 - 61) = 33Now we have:33 + 21 = 54We need to adjust from 54 to 50, so we can:54 - 4 = 50So, the equation becomes:(94 - 61) + (21 - 4) = 50</think><answer>(94 - 61) + (21 - 4)</answer>

<think> First, I'll consider the possibility of directly subtracting two numbers to get close to 31. If I subtract 57 from 63, I get 7. Now, if I subtract this 7 from 25, I get 25 - 7 = 18. Far from 31, so this doesn't work.Next, let's try adding and subtracting. Dividing 63 by 25 will give a decimal, so that's unlikely to work. Now, let's see if adding and subtracting in another order could work. For example, let's try 63 - 25, and then we see this gets us 38. Now, we need a way to get to 31 from here. Subtracting 7 from 38 would work, but we can't figure out how to create that 7 from the given numbers.Hmm, let me try to incorporate all numbers by adding and subtracting. If I try 63 + 25, I get 88. This is quite far from 31, but if 88 - 57, I get 31 exactly!This seems to be the equation: 63 + 25 - 57 equals 31.</think><answer> 63 + 25 - 57 </answer>

成功推理步骤 200 和 450 之间的样本

<think> We need to use the numbers 39, 16, and 55 with basic arithmetic operations to get 78. Let's try different combinations:- 55 + 39 - 16 = 84 - 16 = 68 (too low)- 55 + 16 - 39 = 71 - 39 = 32 (too low)- 55 - 16 + 39 = 39 + 39 = 78 (this works)So, the equation is 55 - 16 + 39 = 78. </think><answer> 55 - 16 + 39 </answer>

<think> We need to use the numbers 21, 5, and 40 with basic arithmetic operations to get 24. Let's try different combinations:- 21 - 5 - 40 = -15- 21 - 5 + 40 = 56- 21 + 5 - 40 = -14- 21 * 5 / 40 = 2.1 (not an integer, let's try another)- 40 * 5 / 21 = 9.52 (not 24)- 40 / 5 - 21 = 9 - 21 = -12- 5 * 40 / 21 = 100 / 21 ~= 4.76 (not 24)- 5 + 40 - 21 = 24 ! This works.</think><answer> 5 + 40 - 21 </answer>

<think> We need to use the numbers 75, 99, 62, and 34 with basic arithmetic operations to get 72. Let's try different combinations:- 99 - 75 + 62 - 34 = 92 - 34 = 58 (too high)- 99 - 75 - 62 + 34 = 24 - 62 + 34 = -38 + 34 = -4 (too low)- 99 + 75 - 62 - 34 = 174 - 96 = 78 (too high)- 99 + 75 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)- 99 - 75 + 62 + 34 = 24 + 96 = 120 (too high)- 75 + 99 - 62 - 34 = 174 - 96 = 78 (too high)- 75 + 99 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)- 75 + 62 - 99 + 34 = 137 - 99 + 34 = 38 + 34 = 72So, 75 + 62 - 99 + 34 equals 72.</think><answer> 75 + 62 - 99 + 34 </answer>

结论

DeepSeek R1 及其研究论文的发布可能是开放科学和开源开发的转折点。就在 DeepSeek 发布一周后，我们已经能够使用 GRPO 和倒计时游戏重现 R1 学习“推理”的简单版本。虽然我们的实现侧重于特定任务而不是一般推理并收敛为非常具体的 “推理” 格式，但它表明该方法正在发挥作用。

在我们的迷你 R1 实验中，我们使用了 GRPO，具有两个基于规则的奖励，但已经需要大量计算： 4 个 H100 GPU 运行 6 小时，在 3B 参数模型上仅完成 450 个训练步骤。这让我们了解了扩展强化学习所需的计算需求。Deepseek 使用 671B 模型运行了 8000 多个步骤，他们可能做了很多次消融。

展望 2025 年，很明显，我们正处于更重大进展的风口浪尖。RL 将变得更加易于访问和用户友好，更多的研究人员和开发人员将探索其潜力，但与以前相比，与监督微调相比，它也需要更多的计算量。