示例代码:
import os
import re
import json
import nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
## Check If Fact or Opinion
#lighteternal/fact-or-opinion-xlmr-el
fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")
def wr_dict(filename,dic):
if not os.path.isfile(filename):
data = []
data.append(dic)
with open(filename, 'w') as f:
json.dump(data, f)
else:
with open(filename, 'r') as f:
data = json.load(f)
data.append(dic)
with open(filename, 'w') as f:
json.dump(data, f)
def rm_file(file_path):
if os.path.exists(file_path):
os.remove(file_path)
with open('datasource/news_filter_dup.json', 'r') as file:
data = json.load(file)
save_path = 'datasource/news_filter_fact.json'
print(len(data))
print(data[0].keys())
rm_file(save_path)
for d in tqdm(data):
fact_list = []
body = d['body']
paragraphs = re.split(r'\n{2,}', body)
for text in paragraphs:
sentences = sent_tokenize(text)
for sentence in sentences:
try:
sentence_result = fact_opinion_classifier(sentence)[0]
# If Fact
if sentence_result["label"] == "LABEL_1":
fact_list.append(sentence)
except:
print(sentence)
d['fact_list'] = fact_list
wr_dict(save_path,d)
💡 代码功能:
把新闻正文拆成句子,每句用模型
lighteternal/fact-or-opinion-xlmr-el
判断是否为“Fact”,如果是,就保留下来,最终写入新的 JSON 文件中。
🔍 分段详细解析
📦 1. 导入依赖 & 初始化模型
import os, re, json, nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
: 下载英文句子分割工具(句子分割器)pipeline("text-classification", model=...)
:加载判断“Fact/Opinion”的模型
fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")
🧱 2. 工具函数
def wr_dict(filename,dic):
...
- 把每个处理好的新闻写入 JSON 文件,逐条追加写入
def rm_file(file_path):
...
- 如果输出文件已存在,先删除,避免旧内容混入
📂 3. 加载输入数据(新闻正文)
with open('datasource/news_filter_dup.json', 'r') as file:
data = json.load(file)
- 加载去重后的新闻文件
🧹 4. 主处理逻辑(提取事实句子)
for d in tqdm(data): # 遍历每条新闻
fact_list = []
body = d['body'] # 新闻正文
- 先用
\n\n
把正文分成段落 - 然后再用
sent_tokenize
分成句子
paragraphs = re.split(r'\n{2,}', body)
for text in paragraphs:
sentences = sent_tokenize(text)
- 对每句话调用模型分类(模型返回一个 label)
for sentence in sentences:
try:
sentence_result = fact_opinion_classifier(sentence)[0]
if sentence_result["label"] == "LABEL_1":
fact_list.append(sentence)
🔵 LABEL_1
就是 Fact(事实)
📝 5. 保存结果
d['fact_list'] = fact_list
wr_dict(save_path,d)
- 把提取出来的事实句子加到原来的数据结构中
- 写入保存路径:
datasource/news_filter_fact.json
✅ 示例说明
输入 JSON 文件(news_filter_dup.json
)的一条数据:
{
"title": "NASA discovers new exoplanet",
"body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water."
}
模型判断过程:
分出来的句子:
NASA announced a new discovery today.
⟶ OpinionThey found a planet that is 1.3 times the size of Earth.
⟶ ✅ FactIt orbits a star 300 light-years away.
⟶ ✅ FactScientists say it might have liquid water.
⟶ Opinion
输出 JSON 文件(news_filter_fact.json
)的一条数据:
{
"title": "NASA discovers new exoplanet",
"body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water.",
"fact_list": [
"They found a planet that is 1.3 times the size of Earth.",
"It orbits a star 300 light-years away."
]
}
🧠 小结
模块 | 作用 |
---|---|
re + sent_tokenize | 切段落、切句子 |
pipeline(...xlmr-el) | 判断句子是 Fact 还是 Opinion |
LABEL_1 | 是 Fact,才保留 |
fact_list | 存放所有 fact 句子 |
wr_dict | 把每条带 fact 的新闻写入文件 |