《书生大模型实战营第3期》基础岛第1关：书生大模型全链路开源体系

文章大纲

简介
- 更新
- 性能
- - 基座模型
  - 对话模型
- 依赖
- 使用案例
- - 通过 Transformers 加载
  - 通过 ModelScope 加载
  - 通过前端网页对话
- InternLM 高性能部署
- - 推理
  - 1百万字超长上下文推理
- 智能体
- 微调&训练
- 评测
- - 标准客观评测
  - 长文评估（大海捞针）
  - 数据污染评估
  - 智能体评估
  - 主观评估
视频课程学习笔记
任务
其他学习内容
参考文献
- 本人学习系列笔记
- - 第二期
  - 第三期
- 课程资源
- 论文
- 其他参考

简介

官网：

https://internlm.intern-ai.org.cn/

github：

https://github.com/InternLM/InternLM

InternLM2.5 系列模型在本仓库正式发布，具有如下特性：

卓越的推理性能：在数学推理方面取得了同量级模型最优精度，超越了 Llama3 和 Gemma2-9B。
有效支持百万字超长上下文：模型在 1 百万字长输入中几乎完美地实现长文“大海捞针”，而且在 LongBench 等长文任务中的表现也达到开源模型中的领先水平。可以通过 LMDeploy 尝试百万字超长上下文推理。更多内容和文档对话 demo 请查看这里。
工具调用能力整体升级：InternLM2.5 支持从上百个网页搜集有效信息进行分析推理，相关实现将于近期开源到 Lagent。InternLM2.5 具有更强和更具有泛化性的指令理解、工具筛选与结果反思等能力，新版模型可以更可靠地支持复杂智能体的搭建，支持对工具进行有效的多轮调用，完成较复杂的任务。可以查看更多样例。

更新

[2024.07.19] 我们发布了 1.8B、7B 和 20B 大小的 InternLM2-Reward 系列奖励模型。可以在下方的模型库进行下载，或者在 model cards 中了解更多细节。

[2024.06.30] 我们发布了 InternLM2.5-7B、InternLM2.5-7B-Chat 和 InternLM2.5-7B-Chat-1M。可以在下方的模型库进行下载，或者在 model cards 中了解更多细节。

[2024.03.26] 我们发布了 InternLM2 的技术报告。可以点击 arXiv链接来了解更多细节。

[2024.01.31] 我们发布了 InternLM2-1.8B，以及相关的对话模型。该模型在保持领先性能的情况下，提供了更低廉的部署方案。

[2024.01.23] 我们发布了 InternLM2-Math-7B 和 InternLM2-Math-20B 以及相关的对话模型。InternLM-Math以较小的尺寸超过了ChatGPT的表现。可以点击InternLM-Math进行下载，并了解详情。

[2024.01.17] 我们发布了 InternLM2-7B 和 InternLM2-20B 以及相关的对话模型，InternLM2 在数理、代码、对话、创作等各方面能力都获得了长足进步，综合性能达到开源模型的领先水平。可以点击下面的模型库进行下载或者查看模型文档来了解更多细节.

[2023.12.13] 我们更新了 InternLM-7B-Chat 和 InternLM-20B-Chat 模型权重。通过改进微调数据和训练策略，新版对话模型生成的回复质量更高、语言风格更加多元。

[2023.09.20] InternLM-20B 已发布，包括基础版和对话版。

模型说明：

目前 InternLM 2.5 系列只发布了 7B 大小的模型，我们接下来将开源 1.8B 和 20B 的版本。7B 为轻量级的研究和应用提供了一个轻便但性能不俗的模型，20B 模型的综合性能更为强劲，可以有效支持更加复杂的实用场景。每个规格不同模型关系如下所示：

InternLM2.5：经历了大规模预训练的基座模型，是我们推荐的在大部分应用中考虑选用的优秀基座。
InternLM2.5-Chat: 对话模型，在 InternLM2.5 基座上经历了有监督微调和 online RLHF。InternLM2.5-Chat 面向对话交互进行了优化，具有较好的指令遵循、共情聊天和调用工具等的能力，是我们推荐直接用于下游应用的模型。
InternLM2.5-Chat-1M: InternLM2.5-Chat-1M 支持一百万字超长上下文，并具有和 InternLM2.5-Chat 相当的综合性能表现。

局限性： 尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。

补充说明： 上表中的 HF 表示对应模型为 HuggingFace 平台提供的 transformers 框架格式；Origin 则表示对应模型为我们 InternLM 团队的 InternEvo 框架格式。

性能

我们使用开源评测工具 OpenCompass 在几个重要的基准测试中对 InternLM2.5 进行了评测。部分评测结果如下表所示。欢迎访问 OpenCompass 排行榜获取更多评测结果。

基座模型

Benchmark	InternLM2.5-7B	Llama3-8B	Yi-1.5-9B
MMLU (5-shot)	71.6	66.4	71.6
CMMLU (5-shot)	79.1	51.0	74.1
BBH (3-shot)	70.1	59.7	71.1
MATH (4-shot)	34.0	16.4	31.9
GSM8K (4-shot)	74.8	54.3	74.5
GPQA (0-shot)	31.3	31.3	27.8

对话模型

Benchmark	InternLM2.5-7B-Chat	Llama3-8B-Instruct	Gemma2-9B-IT	Yi-1.5-9B-Chat	GLM-4-9B-Chat	Qwen2-7B-Instruct
MMLU (5-shot)	72.8	68.4	70.9	71.0	71.4	70.8
CMMLU (5-shot)	78.0	53.3	60.3	74.5	74.5	80.9
BBH (3-shot CoT)	71.6	54.4	68.2*	69.6	69.6	65.0
MATH (0-shot CoT)	60.1	27.9	46.9	51.1	51.1	48.6
GSM8K (0-shot CoT)	86.0	72.9	88.9	80.1	85.3	82.9
GPQA (0-shot)	38.4	26.1	33.8	37.9	36.9	38.4

我们使用 ppl 对基座模型进行 MCQ 指标的评测。
评测结果来自 OpenCompass ，评测配置可以在 OpenCompass 提供的配置文件中找到。
由于 OpenCompass 的版本迭代，评测数据可能存在数值差异，因此请参考 OpenCompass 的最新评测结果。
* 表示从原论文中复制而来。

依赖

Python >= 3.8
PyTorch >= 1.12.0 (推荐 2.0.0 和更高版本)
Transformers >= 4.38

使用案例

InternLM 支持众多知名的上下游项目，如 LLaMA-Factory、vLLM、llama.cpp 等。这种支持使得广大用户群体能够更高效、更方便地使用 InternLM 全系列模型。为方便使用，我们为部分生态系统项目提供了教程，访问此处即可获取。

接下来我们展示使用 Transformers，ModelScope 和 Web demo 进行推理。
对话模型采用了 chatml 格式来支持通用对话和智能体应用。
为了保障更好的使用效果，在用 Transformers 或 ModelScope 进行推理前，请确保安装的 transformers 库版本满足以下要求：

transformers >= 4.38

通过 Transformers 加载

通过以下的代码从 Transformers 加载 InternLM2.5-7B-Chat 模型（可修改模型名称替换不同的模型）

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat", trust_remote_code=True)
# 设置`torch_dtype=torch.float16`来将模型精度指定为torch.float16，否则可能会因为您的硬件原因造成显存不足的问题。
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat", device_map="auto",trust_remote_code=True, torch_dtype=torch.float16)
# (可选) 如果在低资源设备上，可以通过bitsandbytes加载4-bit或8-bit量化的模型，进一步节省GPU显存.
  # 4-bit 量化的 InternLM 7B 大约会消耗 8GB 显存.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
# 模型输出：你好！有什么我可以帮助你的吗？
response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
print(response)

通过 ModelScope 加载

通过以下的代码从 ModelScope 加载 InternLM2.5-7B-Chat 模型（可修改模型名称替换不同的模型）

import torch
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2_5-7b-chat')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)
# (可选) 如果在低资源设备上，可以通过bitsandbytes加载4-bit或8-bit量化的模型，进一步节省GPU显存.
  # 4-bit 量化的 InternLM 7B 大约会消耗 8GB 显存.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

通过前端网页对话

可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互

pip install streamlit
pip install transformers>=4.38
streamlit run ./chat/web_demo.py

InternLM 高性能部署

我们使用 LMDeploy 完成 InternLM 的一键部署。

推理

通过 pip install lmdeploy 安装 LMDeploy 之后，只需 4 行代码，就可以实现离线批处理：

from lmdeploy import pipeline
pipe = pipeline("internlm/internlm2_5-7b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

为了减少内存占用，我们提供了4位量化模型 internlm2_5-7b-chat-4bit。可以按照如下方式推理该模型：

from lmdeploy import pipeline
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

此外，可以同步开启 8bit 或者 4bit KV 在线量化功能：

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit",
                backend_config=TurbomindEngineConfig(quant_policy=8))
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

更多使用案例可参考部署指南，详细的部署教程则可在这里找到。

1百万字超长上下文推理

激活 LMDeploy 的 Dynamic NTK 能力，可以轻松把 internlm2_5-7b-chat 外推到 200K 上下文。

注意: 1M 上下文需要 4xA100-80G。

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(
        rope_scaling_factor=2.5,
        session_len=1048576,  # 1M context length
        max_batch_size=1,
        cache_max_entry_count=0.7,
        tp=4)  # 4xA100-80G.
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
prompt = 'Use a long prompt to replace this sentence'
response = pipe(prompt)
print(response)