KnowLM知识抽取大模型

文章目录

KnowLM项目介绍
KnowLM项目的动机
- ChatGPT存在的问题
基于LLama的知识抽取的智析大模型
- 数据集构建及训练过程
- - 预训练数据集构建
  - 预训练训练过程
  - 指令微调数据集构建
- 指令微调训练过程
- 开源的数据集及模型
- 局限性
- 信息抽取Prompt
部署
- 环境配置
- 模型下载
- 预训练模型使用
- LoRA模型使用

KnowLM项目介绍

KnowLM 是由浙江大学NLP&KG团队的在读博士生研发并开源的项目，是一种将LLM与知识图谱结合的知识抽取大模型，主要包含的任务有命名实体识别（NER）、事件抽取（EE）、关系抽取（RE）。
github 地址：https://github.com/zjunlp/KnowLM/blob/main/README_ZH.md
在这里插入图片描述
KnowLM 项目的主要工作：

围绕知识和大模型，用构建的中英双语预训练语料对大模型如LLaMA进行全量预训练
基于知识图谱转换指令技术对知识抽取任务，包括NER、RE、IE进行优化，可以使用人类指令来完成信息抽取任务
用构建的中文指令数据集（约1400K条样本），使用LoRA微调，提高模型对于人类指令的理解
开源了预训练模型的权重、指令微调的LoRA权重
开源了全量预训练脚本（提供大型语料的转换、构建和加载）和LoRA指令微调脚本（支持多机多卡）

KnowLM项目的动机

ChatGPT存在的问题

目前，大模型如ChatGPT等虽然在自然语言领域已经取得了显著的成就，但在学习和理解知识方面仍然有一些挑战和问题，如：

LLM存在知识固化，知识更新困难，以及模型中潜在的错误和偏差等知识谬误的问题。

在这里插入图片描述

LLM对于特定任务的能力欠佳：如知识抽取、推理等；

在这里插入图片描述

从上面的示例可以看出，尽管ChatGPT可以对指令进行理解，并给出了合理的输出格式，但效果看起来并不好。

论文 LLMs for Knowledge Graph Construction and Reasoning:
Recent Capabilities and Future Opportunities 中对GPT-4、ChatGPT及在特定任务上的微调模型等在知识图谱构建和推理任务中的能力进行了评估，具体任务如关系抽取、事件检测、链接预测、问答等，评估效果如下所示：
在这里插入图片描述

基于LLama的知识抽取的智析大模型

数据集构建及训练过程

智析大模型的整个训练过程分为两个阶段：

第一阶段：全量预训练阶段。该阶段的目的是增强模型的中文能力和知识储备。
第二阶段：使用LoRA的指令微调阶段。该阶段让模型能够理解人类的指令并输出合适的内容。

数据集使用情况及训练过程如下图所示：
在这里插入图片描述

在这里插入图片描述

预训练数据集构建

为了在保留原来的代码能力和英语能力的前提下，来提升模型对于中文的理解能力，并没有对词表进行扩增，而是搜集了中文语料、英文语料和代码语料。

中文数据集：中文语料来自于百度百科、悟道和中文维基百科；

英文数据集：主要从LLaMA原始的英文语料中进行采样，不同的是维基数据，原始论文中的英文维基数据的最新时间点是2022年8月，团队成员额外爬取了2022年9月到2023年2月，总共六个月的数据；

代码数据集：爬取了Github、Leetcode的代码数据，一部分用于预训练，另外一部分用于指令微调。

对上面爬取到的数据集，团队成员使用了启发式的方法，剔除了数据集中有害的内容，此外，我们还剔除了重复的数据。

预训练训练过程

文档划分：通过贪心算法来对文档进行划分，贪心的目标是在保证每个样本都是完整的句子、分割的段数尽可能少的前提下，尽可能保证每个样本的长度尽可能长，设置的单个样本的最大长度是1024。
在这里插入图片描述

由于数据源的多样性，还设计了一套完整的数据预处理工具，可以对各个数据源进行处理然后合并。

由于数据量很大，如果直接将数据加载到内存，会导致硬件压力过大，于是参考了DeepSpeed-Megatron，使用mmap的方法对数据进行处理和加载，即将索引读入内存，需要的时候根据索引去硬盘查找。

最后，在5500K条中文样本、1500K条英文样本、900K条代码样本进行预训练。使用了 transformers 的 trainer 搭配 Deepspeed ZeRO3（实测使用ZeRO2在多机多卡场景的速度较慢），在3个Node（每个Node上为8张32GB V100卡）进行多机多卡训练。

Deepspeed ZeRO3解读：https://blog.csdn.net/v_JULY_v/article/details/132462452?spm=1001.2014.3001.5501

训练相关参数设置如下：

参数	值
micro batch size（单张卡的batch size大小）	20
gradient accumulation（梯度累积）	3
global batch size（一个step的、全局的batch size）	20 * 3 * 24=1440
一个step耗时	260s

指令微调数据集构建

考虑到要加入一些通用的能力（比如推理能力、代码能力等），以及还要额外加入信息抽取能力（包括NER、RE、EE），使用下面数据集。

通用能力数据集（如推理能力、代码能力）
为了获得中文数据集，主要采用对英文数据集使用GPT4翻译的方式得到。英文数据集如比如alpaca数据集 CoT数据集、 代码数据集。
具体方式如下：
对于CoT数据集、 代码数据集等英文数据集，直接将问题和答案通过GPT4翻译成英文；
对于通用数据集如alpaca数据集，将英文问题输入给模型，让模型输出中文回答。

信息抽取(IE)数据集
英文数据集：对于如CoNLL ACE CASIS等开源的IE英文数据集，直接构造相应的英文指令数据集；
中文数据集：除了使用了开源的数据集如DuEE、PEOPLE DAILY、DuIE等，还采用了我们自己构造的KG2Instruction，构造相应的中文指令数据集。

KG2Instruction(InstructIE)是一个在中文维基百科和维基数据上通过远程监督获得的中文信息抽取数据集，涵盖广泛的领域以满足真实抽取需求。

此外，还额外手动构建了中文的通用数据集，使用第二种策略将其翻译成英文。最后得到的数据集分布如下：

数据集类型	条数
COT（中英文）	202,333
通用数据集（中英文）	105,216
代码数据集（中英文）	44,688
英文指令抽取数据集	537,429
中文指令抽取数据集	486,768

KG2Instruction及其他指令微调数据集 流程示意图
在这里插入图片描述

指令微调训练过程

目前大多数的微调脚本都是基于alpaca-lora，因此此处不再赘述。详细的指令微调训练参数、训练脚本可以在./finetune/lora找到。

开源的数据集及模型

指令类型	数量	下载地址	智析是否使用	说明
KnowLM-CR (推理相关指令数据，中英双语)	202,333	谷歌云盘 HuggingFace	是	无
KnowLM-IE (抽取相关指令数据，中文)	281,860	谷歌云盘 HuggingFace	是	由于采用远程监督，因此存在噪音
KnowLM-Tool (工具学习相关指令数据，英文)	38,241	谷歌云盘 HuggingFace	否	将在下一个版本使用

数据说明：

信息抽取的其他数据源来源于CoNLL ACE casis DuEE People Daily DuIE等；
KnowLM-Tool数据集来源于论文《Making Language Models Better Tool Learners with Execution Feedback》，github链接位于此处。
KnowLM-IE数据集来源于论文《InstructIE: A Chinese Instruction-based Information Extraction Dataset》，github链接位于此处。

类别	底座	名称	版本	下载链接	备注
基础模型	LlaMA1	KnowLM-13B-Base	V1.0	HuggingFace	底座模型
对话模型	LlaMA1	KnowLM-13B-ZhiXi	V1.0	HuggingFace	抽取模型
对话模型	LlaMA1	KnowLM-13B-IE	V1.0	HuggingFace	抽取模型

knowlm-13b-zhixi和knowlm-13b-ie，是基于knowlm-13b-base使用lora训练得到的，knowlm-13b-zhixi和knowlm-13b-ie是将训练得到的lora权重和knowlm-13b-base权重合并后的模型参数。

局限性

指令微调并没有使用全量指令微调，而是使用了LoRA的方式进行微调；
模型暂不支持多轮对话；
尽管致力于模型输出的有用性、合理性、无害性，但是在一些场景下，仍然会不可避免的出现有毒的输出；
预训练不充分，虽然准备了大量的预训练语料，但是没有完全跑完（没有足够的计算资源!）；

信息抽取Prompt

对于信息抽取任务，比如命名实体识别（NER）、事件抽取（EE）、关系抽取（RE），提供了一些prompt便于使用，可以参考此处。当然你也可以尝试使用自己的Prompt。

relation_template =  {
    0:'已知候选的关系列表：{s_schema}，请你根据关系列表，从以下输入中抽取出可能存在的头实体与尾实体，并给出对应的关系三元组。请按照{s_format}的格式回答。',
    1:'我将给你个输入，请根据关系列表：{s_schema}，从输入中抽取出可能包含的关系三元组，并以{s_format}的形式回答。',
    2:'我希望你根据关系列表从给定的输入中抽取可能的关系三元组，并以{s_format}的格式回答，关系列表={s_schema}。',
    3:'给定的关系列表是：{s_schema}\n根据关系列表抽取关系三元组，在这个句子中可能包含哪些关系三元组？请以{s_format}的格式回答。',
}

relation_int_out_format = {
    0:['"(头实体,关系,尾实体)"', relation_convert_target0],
    2:['"关系：头实体,尾实体\n"', relation_convert_target2],
    3:["JSON字符串[{'head':'', 'relation':'', 'tail':''}, ]", relation_convert_target3],
}


en_relation_template = {
    0: 'Identify the head entities (subjects) and tail entities (objects) in the following text and provide the corresponding relation triples from relation list {s_schema}. Please provide your answer as a list of relation triples in the form of {s_format}.',
    1: 'Identify the subjects and objects in the text that are related, and provide the corresponding relation triples from relation {s_schema} in the format of {s_format}.',
    2: 'From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples. The relations are {s_schema}. Please format your answer as a list of relation triples in the form of {s_format}.',
    3: 'Your task is to identify the head entities (subjects) and tail entities (objects) in the following text and extract the corresponding relation triples, the possible relation list is {s_schema}. Your answer should include relation triples, with each triple formatted as {s_format}.',
    4: 'Given the text, extract the possible head entities (subjects) and tail entities (objects) and provide the corresponding relation triples, the possible relation list is {s_schema}. Format your answer as a list of relation triples in the form of {s_format}.',
    5: 'Your goal is to identify the head entities (subjects) and tail entities (objects) in the text and give the corresponding relation triples. The given relation list is {s_schema}. Please answer with a list of relation triples in the form of {s_format}.',
    6: 'Please extract the possible head entities (subjects) and tail entities (objects) from the text and provide the corresponding relation triples from candidate relation list {s_schema}. Your answer should be in the form of a list of relation triples: {s_format}.',
    7: 'Your task is to extract the possible head entities (subjects) and tail entities (objects) in the given text and give the corresponding relation triples. The relations are {s_schema}. Please answer using the format of a list of relation triples: {s_format}.',
    8: 'Given the {s_schema}, identify the head entities (subjects) and tail entities (objects) and provide the corresponding relation triples. Your answer should consist of relation triples, with each triple formatted as {s_format}',
    9: 'Please find the possible head entities (subjects) and tail entities (objects) in the text based on the relation list {s_schema} and give the corresponding relation triples. Please format your answer as a list of relation triples in the form of {s_format}.',
    10: 'Given relation list {s_schema}, extract the possible subjects and objects from the text and give the corresponding relation triples in the format of {s_format}.',
    11: 'Extract the entities involved in the relationship described in the text and provide the corresponding triples in the format of {s_format}, the possible relation list is {s_schema}.',
    12: 'Given relation list {s_schema}, provide relation triples for the entities and their relationship in the text, using the format of {s_format}.',
    13: 'Extract the entities and their corresponding relationships from the given relationships are {s_schema} and provide the relation triples in the format of {s_format}.',
}

en_relation_int_out_format = {
    0: "{'head':'', 'relation':'', 'tail':''}",
    1: "(Subject, Relation, Object)",
    2: "[Subject, Relation, Object]",
    3: "{head, relation, tail}",
    4: "<head, relation, tail>",
}

部署

环境配置

# 下载 KnowLM 仓库代码
git clone https://github.com/zjunlp/KnowLM.git
# 进入KnowLM目录
cd KnowLM
# 激活conda环境
source activate
# 新建一个conda环境
conda create -n knowlm python=3.9 -y
# 进入knowlm的conda环境
conda activate knowlm
# 安装GPU版本的torch
pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
# 安装其他相关依赖
pip install -r requirements.txt

模型下载

由于huggingface模型下载不稳定，故可以使用 https://hf-mirror.com 来下载模型，相关命令如下：

pip install -U huggingface_hub

export HF_ENDPOINT=https://hf-mirror.com

export HF_HOME='~/autodl-tmp/.cache/huggingface/hub'
# 下载 knowlm-13b-base
huggingface-cli download --resume-download --local-dir-use-symlinks False zjunlp/knowlm-13b-base-v1.0 --local-dir knowlm-13b-base-v1.0
# 下载 knowlm-13b-zhixi
huggingface-cli download --resume-download --local-dir-use-symlinks False zjunlp/knowlm-13b-zhixi --local-dir knowlm-13b-zhixi

预训练模型使用

python examples/generate_finetune_web.py --base_model /root/autodl-tmp/knowlm-13b-base-v1.0

LoRA模型使用

python examples/generate_lora_web.py --base_model /root/autodl-tmp/knowlm-13b-zhixi

效果如下：
在这里插入图片描述
从给定的文本中提取可能的实体和实体类型，可选的实体类型为[‘地点’,‘人名’]，以（实体，实体类型）的格式回答。

John昨天在纽约的咖啡馆见到了他的好朋友Merry，他们一起喝咖啡聊天，计划着下周去加利福尼亚（California）旅行，他们决定一起租车并预订酒店，他们先计划在下周一去圣弗朗西斯科参观旧金山大桥，下周三去洛杉矶拜访Merry的父亲威廉。