【已解决】关于如何将Doccano标注的文本转换成NER模型可以直接处理的CoNLL 2003格式

news2025/12/26 5:21:42

笔者要做命名实体识别（NER）的工作，选择了Doccano平台来进行文本标注。

Doccano平台对标注结果的导出格式是JSONL格式，我们导出了NER.jsonl文件。

但是用python语言搭建深度学习模型来实现NER时，一般接收的输入数据格式为CoNLL 2003格式，需要将Doccano导出的JSONL数据转换成CoNLL 2003格式。CoNLL 2003格式大概长下面这样，左边是原文，右边是标签：

刚开始我还琢磨怎么变代码做转换，后来查到Doccano有官方的转换工具：doccano-transformer，就是个python库，用起来很方便，下面是官方给出的使用代码：

先在命令提示符里安装：

pip install doccano-transformer

再用python语句来使用：

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')
dataset.to_conll2003(tokenizer=str.split)

但是官方给的代码不够完整，没有把结果转成可以直接操作的txt文本，下面是我真正使用的代码，增加了将转换结果存储成txt文件这一环节：

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

dataset = read_jsonl(filepath='NER.jsonl', dataset=NERDataset, encoding='utf-8')
gen=dataset.to_conll2003(tokenizer=str.split)

file_name="CoNLL.txt"

with open(file_name, "w", encoding = "utf-8") as file:
    for item in gen:
        file.write(item["data"] + "\n")

但却报错，提示：KeyError: 'The file should includes either "labels" or "annotations".'：

在网上找了很久发现了解决办法，需要两步：

①将导出的jsonl文件里的“entities”标签转换成“annotations”。

②将“doccano_transformer\examples.py”脚本中第29行的“doccano_transformer\examples.py”修改成“labels[0].append([”。（截图中使用Notepad++打开的examples.py脚本）

然后再按照我们之前的转换代码运行就可以了：

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

dataset = read_jsonl(filepath='NER.jsonl', dataset=NERDataset, encoding='utf-8')
gen=dataset.to_conll2003(tokenizer=str.split)

file_name="CoNLL.txt"

with open(file_name, "w", encoding = "utf-8") as file:
    for item in gen:
        file.write(item["data"] + "\n")

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1051591.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！