从一堆新闻正文中，提取出“事实型句子（fact）”，并保存到新文件中

news2025/4/15 12:53:18

示例代码：

import os
import re
import json
import nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
## Check If Fact or Opinion
#lighteternal/fact-or-opinion-xlmr-el

fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")

def wr_dict(filename,dic):
    if not os.path.isfile(filename):
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:      
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
            
def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)

with open('datasource/news_filter_dup.json', 'r') as file:
    data = json.load(file)

save_path = 'datasource/news_filter_fact.json'
print(len(data))
print(data[0].keys())
rm_file(save_path)

for d in tqdm(data):
    fact_list = []
    body = d['body']
    paragraphs = re.split(r'\n{2,}', body)
    for text in paragraphs:
        sentences = sent_tokenize(text)
        for sentence in sentences:
            try:
                sentence_result = fact_opinion_classifier(sentence)[0]
                # If Fact
                if sentence_result["label"] == "LABEL_1":
                    fact_list.append(sentence)
            except:
                print(sentence)
    d['fact_list'] = fact_list
    wr_dict(save_path,d)

💡 代码功能：

把新闻正文拆成句子，每句用模型 lighteternal/fact-or-opinion-xlmr-el 判断是否为“Fact”，如果是，就保留下来，最终写入新的 JSON 文件中。

🔍 分段详细解析

📦 1. 导入依赖 & 初始化模型

import os, re, json, nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

nltk.download('punkt'): 下载英文句子分割工具（句子分割器）
pipeline("text-classification", model=...)：加载判断“Fact/Opinion”的模型

fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")

🧱 2. 工具函数

def wr_dict(filename,dic):
    ...

把每个处理好的新闻写入 JSON 文件，逐条追加写入

def rm_file(file_path):
    ...

如果输出文件已存在，先删除，避免旧内容混入

📂 3. 加载输入数据（新闻正文）

with open('datasource/news_filter_dup.json', 'r') as file:
    data = json.load(file)

加载去重后的新闻文件

🧹 4. 主处理逻辑（提取事实句子）

for d in tqdm(data):  # 遍历每条新闻
    fact_list = []
    body = d['body']  # 新闻正文

先用 \n\n 把正文分成段落
然后再用 sent_tokenize 分成句子

    paragraphs = re.split(r'\n{2,}', body)
    for text in paragraphs:
        sentences = sent_tokenize(text)

对每句话调用模型分类（模型返回一个 label）

        for sentence in sentences:
            try:
                sentence_result = fact_opinion_classifier(sentence)[0]
                if sentence_result["label"] == "LABEL_1":
                    fact_list.append(sentence)

🔵 LABEL_1 就是 Fact（事实）

📝 5. 保存结果

    d['fact_list'] = fact_list
    wr_dict(save_path,d)

把提取出来的事实句子加到原来的数据结构中
写入保存路径：datasource/news_filter_fact.json

✅ 示例说明

输入 JSON 文件（`news_filter_dup.json`）的一条数据：

{
  "title": "NASA discovers new exoplanet",
  "body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water."
}

模型判断过程：

分出来的句子：

NASA announced a new discovery today. ⟶ Opinion
They found a planet that is 1.3 times the size of Earth. ⟶ ✅ Fact
It orbits a star 300 light-years away. ⟶ ✅ Fact
Scientists say it might have liquid water. ⟶ Opinion

输出 JSON 文件（`news_filter_fact.json`）的一条数据：

{
  "title": "NASA discovers new exoplanet",
  "body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water.",
  "fact_list": [
    "They found a planet that is 1.3 times the size of Earth.",
    "It orbits a star 300 light-years away."
  ]
}

🧠 小结

模块	作用
`re + sent_tokenize`	切段落、切句子
`pipeline(...xlmr-el)`	判断句子是 Fact 还是 Opinion
`LABEL_1`	是 Fact，才保留
`fact_list`	存放所有 fact 句子
`wr_dict`	把每条带 fact 的新闻写入文件