信息抽取数据集处理—

引言

RAMS数据集（RAMS：Richly Annotated Multilingual Schema-guided Event Structure）由约翰斯·霍普金斯大学于2020年发布，是一个以新闻为基础的事件抽取数据集。它标注了9,124个事件，涵盖了139种不同的事件类型和65种元素角色类型。事件类型涉及多个领域，如：

生命事件（life）
冲突事件（conflict）
灾难事件（disaster）
司法事件（justice）
联络事件（contact）
政府事件（government）

而元素角色类型包括如：

地点（place）
参与者（participant）
目的地（destination）
起源（origin）
受害者（victim）
被告人（defendant）

这个数据集非常适合用于事件抽取、自然语言处理任务，特别是对事件结构、事件角色的识别和分类。

一、特点（features）

事件类型多样化：涵盖多个领域，增强了事件抽取任务的广泛性和复杂性。
角色标注详细：为每个事件详细标注了不同的角色，为构建事件图、进行因果推理等任务提供了丰富的上下文信息。
结构化标注：不仅仅提供文本，还为每个事件及其参与者标注了详细的语义信息，使其适用于高层次的文本分析。

二、下载（download）

可以通过访问官方下载网站进行最新和历史数据集的下载。
也可以通过访问我的主页提供的数据集来进行下载。

三、数据集（database）

3.1 数据

数据被分成 train/dev/test 三个文件，

每个数据文件的每一行包含一个 json 字符串，

每个 json 包含：

ent_spans：开始和结束（包含）索引以及事件/参数/角色字符串。
evt_triggers：开始和结束（包括）索引以及事件类型字符串。
sentences：文档文本
gold_evt_links：遵循上述格式的三元组（事件、论点、角色）
source_url：文本来源
split：它属于哪个数据分割
doc_key：它对应于哪个单独的文件（添加到所有文件前面）

所有其他字段都是多余的，以允许 RAMS 的未来迭代。

格式化之后的一条数据（train.jsonlines的第1行）如下展示：

{
    "rel_triggers": [],
    "gold_rel_links": [],
    "doc_key": "nw_RC000462ebb18ca0b29222d5e557fa31072af8337e3a0910dca8b5b62f",
    "ent_spans": [
        [
            42,
            43,
            [
                [
                    "evt090arg02victim",
                    1.0
                ]
            ]
        ],
        [
            85,
            88,
            [
                [
                    "evt090arg01killer",
                    1.0
                ]
            ]
        ],
        [
            26,
            26,
            [
                [
                    "evt090arg04place",
                    1.0
                ]
            ]
        ]
    ],
    "language_id": "eng",
    "source_url": "https://www.washingtonpost.com/news/powerpost/paloma/daily-202/2016/06/17/daily-202-more-republicans-ditch-trump-conclude-he-cannot-win/5763a1e0981b92a22d0f8a36/",
    "evt_triggers": [
        [
            69,
            69,
            [
                [
                    "life.die.deathcausedbyviolentevents",
                    1.0
                ]
            ]
        ]
    ],
    "split": "train",
    "sentences": [
        [
            "Transportation",
            "officials",
            "are",
            "urging",
            "carpool",
            "and",
            "teleworking",
            "as",
            "options",
            "to",
            "combat",
            "an",
            "expected",
            "flood",
            "of",
            "drivers",
            "on",
            "the",
            "road",
            "."
        ],
        [
            "(",
            "Paul",
            "Duggan",
            ")"
        ],
        [
            "--",
            "A",
            "Baltimore",
            "prosecutor",
            "accused",
            "a",
            "police",
            "detective",
            "of",
            "\u201c",
            "sabotaging",
            "\u201d",
            "investigations",
            "related",
            "to",
            "the",
            "death",
            "of",
            "Freddie",
            "Gray",
            ",",
            "accusing",
            "him",
            "of",
            "fabricating",
            "notes",
            "to",
            "suggest",
            "that",
            "the",
            "state",
            "\u2019s",
            "medical",
            "examiner",
            "believed",
            "the",
            "manner",
            "of",
            "death",
            "was",
            "an",
            "accident",
            "rather",
            "than",
            "a",
            "homicide",
            "."
        ],
        [
            "The",
            "heated",
            "exchange",
            "came",
            "in",
            "the",
            "chaotic",
            "sixth",
            "day",
            "of",
            "the",
            "trial",
            "of",
            "Baltimore",
            "Officer",
            "Caesar",
            "Goodson",
            "Jr.",
            ",",
            "who",
            "drove",
            "the",
            "police",
            "van",
            "in",
            "which",
            "Gray",
            "suffered",
            "a",
            "fatal",
            "spine",
            "injury",
            "in",
            "2015",
            "."
        ],
        [
            "(",
            "Derek",
            "Hawkins",
            "and",
            "Lynh",
            "Bui",
            ")"
        ]
    ],
    "gold_evt_links": [
        [
            [
                69,
                69
            ],
            [
                85,
                88
            ],
            "evt090arg01killer"
        ],
        [
            [
                69,
                69
            ],
            [
                42,
                43
            ],
            "evt090arg02victim"
        ],
        [
            [
                69,
                69
            ],
            [
                26,
                26
            ],
            "evt090arg04place"
        ]
    ]
}

1. `sentences`：

文档内容被分为多个句子：
- 句子1："Transportation officials are urging carpool and teleworking as options to combat an expected flood of drivers on the road."
- 句子2："(Paul Duggan)"
- 句子3："A Baltimore prosecutor accused a police detective of ‘sabotaging’ investigations related to the death of Freddie Gray."
- 句子4："The heated exchange came in the chaotic sixth day of the trial of Baltimore Officer Caesar Goodson Jr."

2. `evt_triggers`（事件触发器）：

[69, 69] 对应的词是句子3中的 "homicide"，标注事件类型为 "life.die.deathcausedbyviolentevents"（与暴力事件导致的死亡相关）。

3. `ent_spans`（实体标注，开始和结束索引，以及事件角色）：

[42, 43] 对应的词是句子3中的 "Freddie Gray"，角色为 "victim"（受害者）。
[85, 88] 对应的词是句子4中的 "Caesar Goodson Jr."，角色为 "killer"（凶手）。
[26, 26] 对应的词是句子3中的 "Baltimore"，角色为 "place"（地点）。

4. `gold_evt_links`（事件-论点-角色三元组）：

第一个三元组：触发词 "homicide"，论点是 "Caesar Goodson Jr."，角色是 "killer"。
第二个三元组：触发词 "homicide"，论点是 "Freddie Gray"，角色是 "victim"。
第三个三元组：触发词 "homicide"，论点是 "Baltimore"，角色是 "place"。

5. `source_url`：

文档来源是：https://www.washingtonpost.com/news/powerpost/paloma/daily-202/2016/06/17/daily-202-more-republicans-ditch-trump-conclude-he-cannot-win/5763a1e0981b92a22d0f8a36/

6. `split`：

样本属于 训练集（train）。

7. `doc_key`：

对应的文档ID为 "nw_RC000462ebb18ca0b29222d5e557fa31072af8337e3a0910dca8b5b62f"，该ID用于唯一标识文档。

四、数据处理

import json

def load_data(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

def save_to_json(data, file_path):
    with open(file_path, 'w') as f:
        json.dump(data, f, indent=4)

def extract_event_data(entry):
    sentences = [" ".join(s) for s in entry["sentences"]]
    text = [item for sublist in entry["sentences"] for item in sublist]
    # text = entry["sentences"]
    # text = " ".join(sentences)

    # 处理实体
    ent_spans = [(span[0], span[1], span[2][0][0]) for span in entry["ent_spans"]]

    # 处理事件触发词
    evt_triggers = [(trigger[0], trigger[1], trigger[2][0][0]) for trigger in entry["evt_triggers"]]

    # 处理事件-论点链接
    evt_links = entry["gold_evt_links"]

    return text, ent_spans, evt_triggers, evt_links


def prepare_training_data(entries):
    dataset = []
    for entry in entries:
        text, ent_spans, evt_triggers, evt_links = extract_event_data(entry)

        # 生成训练样本
        dataset.append({
            'text': text,
            'entities': ent_spans,
            'triggers': evt_triggers,
            'links': evt_links
        })
    return dataset



if __name__ == '__main__':
    train_data = load_data("./train.jsonlines")

    training_dataset = prepare_training_data(train_data)
    save_to_json(training_dataset, 'train.json')
    print(training_dataset[0])

4.1 加载并解析数据

首先，加载JSON格式的数据文件，并解析其中的字段。

import json

def load_data(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

train_data = load_data('train.json')

4.2 数据预处理

将文档中的句子、事件触发词、角色和实体进行标注与转换，以便用于事件抽取模型。我们可以提取句子、事件触发词及角色信息。

def extract_event_data(entry):
    sentences = [" ".join(s) for s in entry["sentences"]]
    text = " ".join(sentences)
    
    # 处理实体
    ent_spans = [(span[0], span[1], span[2][0][0]) for span in entry["ent_spans"]]
    
    # 处理事件触发词
    evt_triggers = [(trigger[0], trigger[1], trigger[2][0][0]) for trigger in entry["evt_triggers"]]
    
    # 处理事件-论点链接
    evt_links = entry["gold_evt_links"]
    
    return text, ent_spans, evt_triggers, evt_links

# 示例提取
for entry in train_data:
    text, ent_spans, evt_triggers, evt_links = extract_event_data(entry)
    print(f"文本: {text}")
    print(f"实体: {ent_spans}")
    print(f"事件触发词: {evt_triggers}")
    print(f"事件-论点链接: {evt_links}")

4.3 生成模型输入

为了进行事件抽取，常见的输入是文本与相应的事件触发器和角色。我们可以构建一个数据集，将文本标注为序列标注任务或使用分类任务标注事件触发词和论点。

def prepare_training_data(entries):
    dataset = []
    for entry in entries:
        text, ent_spans, evt_triggers, evt_links = extract_event_data(entry)
        
        # 生成训练样本
        dataset.append({
            'text': text,
            'entities': ent_spans,
            'triggers': evt_triggers,
            'links': evt_links
        })
    return dataset

training_dataset = prepare_training_data(train_data)