介绍GPT-o1:一系列解决困难问题（ science, coding, and math ）的推理模型

news2026/2/16 4:38:42

openai o1介绍

一、官方技术报告要点剖析
- 实验1 benchmark分析
- 实验2:和phd比赛
- 技术细节：Chain of Thought的使用
- 人类偏好评估Human preference evaluation
- satety
- 技术细节：隐藏思维链为监控模型提供了机会:)
- openai的几点conclusion
二、官方介绍剖析 Introducing OpenAI o1-preview
- o1的安全性分析
- o1时如何工作的
- o1可以为谁服务？
- OpenAI o1-mini 什么时候选用？
- 下一步的升级计划

一、官方技术报告要点剖析

https://openai.com/index/learning-to-reason-with-llms/
技术报告核心内容解读
报告日期：September 12, 2024

实验1 benchmark分析

On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
在这里插入图片描述

结果分析：

1.	GPT-4o 是基础版本的 GPT-4 模型。在这次考试中，它只能解答 12% 的问题，也就是平均每场考试 15 道题中仅能解答 1.8 道题。
2.	o1 版本 是经过进一步优化的 GPT-4 模型：
	•	单次采样（即每道题只运行一次模型）时，能解答 74% 的问题，平均解答 11.1 道题。
	•	如果对每道题进行 64 次采样（即多次运行模型并选择共识答案），它的正确率提升至 83%，平均解答 12.5 道题。
	•	如果对每道题进行 1000 次采样，并且使用学习得出的评分函数进行重新排序，它的正确率进一步提升至 93%，平均解答 13.9 道题。
3.	取得 13.9 分的表现使得这个模型达到了全美前 500 名学生的水平，并且超过了参加美国数学奥林匹克竞赛（USA Mathematical Olympiad, USAMO）的资格线。

实验2:和phd比赛

We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions.

这里有一个问题，就是，所招募的专家测试结果是找的各个专业的phd做完整的测试，然后取精确率的平均值作为对比数值，还是请他们分别做自己所属专业的部分试题，然后将结果汇总作为专家结果。

技术细节：Chain of Thought的使用

Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem.
Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.
1.通过强化学习，o1学会磨练其思维链并完善其使用的策略
It learns to recognize and correct its mistakes.
2.o1学会了识别和纠正错误。
It learns to break down tricky steps into simpler ones.
3.o1学会了将棘手的步骤分解成简单的步骤。
It learns to try a different approach when the current one isn’t working.
4.当前方法无效时，它学会尝试不同的方法。 在实例当中，会发现

这一过程极大地提高了模型的推理能力。为了说明这一飞跃，报告中还展示了 o1-preview 在几个难题上的思维链。（详情见报告）

人类偏好评估Human preference evaluation

In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred.
在这项评估当中，人类培训师被展示了o1-preview和GPT-4o对于一个提示词的匿名回复，然后投票选出他们喜欢的回复。在数据分析、编码和数学等推理繁重的类别中，o1预览比gpt-4o更受欢迎。 然而，o1预览在某些自然语言任务中并不受欢迎，这表明它并不适合所有用例。
在这里插入图片描述

satety

We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
我们认为，使用思维链可以在安全性和一致性方面取得重大进展，因为（1）它使我们能够以清晰的方式观察模型思维，（2）关于安全规则的模型推理对分布外的场景更稳健。

技术细节：隐藏思维链为监控模型提供了机会:)

the hidden chain of thought allows us to “read the mind” of the model and understand its thought process.
例如，在未来，我们可能希望监控思维链，寻找操纵用户的迹象。
然而，为了实现这一点，模型必须能够以不变的形式自由表达其思想，因此我们无法将任何政策合规性或用户偏好训练到思想链上。我们也不想让用户直接看到不一致的思维链。
因此，在权衡了用户体验、竞争优势和追求思维链监控的选择等多个因素后， we have decided not to show the raw chains of thought to users.

我们承认这一决定有缺点。我们努力通过教模型从答案中的思维链中再现任何有用的想法来部分弥补这一点。

注意，为了让模型保持市场优势，对于思维连的具体过程，openai选择了隐藏，并通过让模型从*真实思维链*和*答案response*中再现思维链中有用的想法的方式来弥补隐藏思维链带来的问题 因此，对于思维链的具体技术细节，无从得知。

openai的几点conclusion

1.o1显著推进了AI reasoning的最新工作
2.我们相信o1会解锁AI在科学、编程、数学等相关领域的新应用案例。
3.openai对于开发者会如何使用o1保持激动和期待。

二、官方介绍剖析 Introducing OpenAI o1-preview

o1的安全性分析

系统卡片：https://openai.com/index/openai-o1-system-card/
我们使用公共和内部评估来衡量不允许的内容、人口统计公平性、幻觉倾向和危险能力等风险。
基于这些评估，我们在模型和系统级别实施了保护措施，如块列表和安全分类器，以有效降低o1的上述这些风险。
部署是安全的，因为它不会实现现有资源之外的任何事情，网络安全和模型自治的风险水平为“低”，化学、生物、放射性和劝导的风险等级为“中”
完整的system card系统介绍pdf

A new series of reasoning models for solving hard problems. Available now.
20240912:https://openai.com/index/introducing-openai-o1-preview/
Update on September 17, 2024: Rate limits are now 50 queries per week for o1-preview and 50 queries per day for o1-mini.

o1时如何工作的

Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes. （定义思考过层、尝试不同的策略、识别其中的错误）
实验细节见技术报告部分。

o1可以为谁服务？

These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields.
o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.
1.医疗保健研究人员用来注释细胞测序数据，
2.被物理学家用来生成量子光学所需的复杂数学公式，
3。被所有领域的开发人员用来构建和执行多步骤工作流程。