数据挖掘实战-基于朴素贝叶斯算法构建真假新闻分类模型

🤵‍♂️ 个人主页：@艾派森的个人主页

✍🏻作者简介：Python学习者
🐋 希望大家多多支持，我们一起进步！😄
如果文章对你有帮助的话，
欢迎评论 💬点赞👍🏻 收藏 📂加关注+

1.项目背景

2.数据集介绍

3.技术工具

4.实验过程

4.1导入数据

4.2数据预处理

4.3数据可视化

4.4特征工程

4.5构建模型

源代码

1.项目背景

在信息化时代，互联网技术的飞速发展极大地促进了新闻媒体的传播速度和范围。然而，这种迅速而广泛的信息传播也带来了虚假新闻泛滥的问题。虚假新闻不仅误导公众，影响社会舆论，还可能对政治稳定、经济秩序和公共安全造成严重的负面影响。因此，如何有效地识别和分类真假新闻，成为了当前亟待解决的重要问题。

为了解决这一难题，研究者们不断探索各种方法和算法来构建真假新闻分类模型。朴素贝叶斯算法作为一种基于概率统计的分类方法，以其坚实的数学基础、稳定的分类效率和简单的算法结构，成为了众多研究者关注的焦点。该算法通过计算给定文本属于不同类别的概率，并选择概率最大的类别作为分类结果，从而实现对新闻文本的有效分类。

朴素贝叶斯算法在真假新闻分类中的优势在于其基于概率统计的建模方式。该算法假设特征条件之间相互独立，从而简化了计算过程。虽然这种假设在实际情况中可能不完全成立，但朴素贝叶斯算法在多数情况下仍然能够取得较好的分类效果。此外，该算法还具有参数少、对缺失数据不敏感等优点，适用于处理大规模数据集。

因此，本研究旨在基于朴素贝叶斯算法构建一个真假新闻分类模型，以提高新闻分类的准确性和效率。通过利用朴素贝叶斯算法的优势，该模型可以自动对新闻文本进行分类，减少人工干预的成本和错误率。同时，该模型还可以为新闻监管机构提供技术支持，帮助他们及时发现和处理虚假新闻，维护新闻媒体的公信力。此外，本研究还可以为相关领域的研究提供有益的参考和借鉴，推动相关领域的发展和创新。

2.数据集介绍

本实验数据集来源于Kaggle，合并后的数据集共有44898条，5个变量。

关于数据集

数据集分为两个文件：

Fake.csv（23502 条假新闻文章）
True.csv（21417 篇真实新闻文章）

数据集列：

标题：新闻文章的标题
文本：新闻文章的正文
主题：新闻文章的主题
日期：新闻文章的发布日期

3.技术工具

Python版本:3.9

代码编辑器：jupyter notebook

4.实验过程

4.1导入数据

导入数据分析相关的第三方库并加载数据集，然后给两个数据集加上标签列并合并数据集

查看数据大小

查看数据基本信息

查看数据的描述性统计

4.2数据预处理

统计数据缺失值情况

可以发现原始数据集并不存在缺失值

统计数据重复值情况

可以发现数据集中有209个重复值需要处理

删除重复值

4.3数据可视化

对text变量进行处理：

小写字母转换

删除停止词

删除标点符号

删除标签

去除特殊字符

让我们用词云可视化真假新闻关键词

4.4特征工程

首先准备建模用到的数据，自变量X和因变量y，接着拆分数据集为训练集和测试集

4.5构建模型

我们将尝试两种不同的方法来拟合多项式朴素贝叶斯

1. 使用CountVectorizer

2. 使用TF-IDF

首先使用CountVectorizer

Count Vectorizer给出了96%的准确率，现在让我们尝试TF-IDF。

TF-IDF

词频-逆文档频率(Term Frequency- inverse Document Frequency, TF-IDF)向量化器是一种在自然语言处理(natural language processing, NLP)中用于将原始文档集合转换为适合机器学习算法的数值表示的流行技术。

从结果可以看出，与TF-IDF(0.94)相比，CountVectorizer提供了更好的精度(0.96)。

源代码

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

fake = pd.read_csv("Fake.csv")
fake['label'] = 0 # 假新闻标记为0
true = pd.read_csv("True.csv")
true['label'] = 1 # 真新闻标记为1
df = pd.concat([fake,true],axis=0) # 合并真假数据集
df.head()
df.shape
df.info()
df.describe(include='O')
df.isnull().sum()
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df.duplicated().sum()
# 可视化真假新闻个数
sns.countplot(x='label' ,data=df)
plt.show()
# 可视化各类别新闻个数
plt.figure(figsize=(12,6))
sns.countplot(x='subject' ,data=df)
plt.show()
数据预处理：

小写字母转换
删除停止词
删除标点符号
Lemmatizing
删除标签
去除特殊字符
import re
from nltk.corpus import stopwords
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    punctuations = '@#!?+&*[]-%.:/();$=><|{}^' + "'`" + '_'
    for p in punctuations:
        text = text.replace(p,'') # 删除标点符号
    return text

def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]

df['text'] = df['text'].apply(lambda x:clean_text(x))
from nltk.tokenize import word_tokenize
def remove_stopwords_from_sentence(sentence):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(sentence)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(filtered_tokens)

# 将函数应用于'sentence'列
df['text'] = df['text'].apply(remove_stopwords_from_sentence)
df.head()
让我们用词云可视化真假新闻关键词
from wordcloud import WordCloud
text = " ".join(i for i in df[df['label']==1]['text'])
wordcloud = WordCloud( background_color="white").generate(text)

plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('wordcloud for True News')
plt.show()
text = " ".join(i for i in df[df['label']==0]['text'])
wordcloud = WordCloud( background_color="white").generate(text)

plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('wordcloud for Fake News')
plt.show()
from sklearn.model_selection import train_test_split
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
我们将尝试两种不同的方法来拟合多项式朴素贝叶斯
1. 使用CountVectorizer
2. 使用TF-IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,classification_report,ConfusionMatrixDisplay,accuracy_score
# Count_Vectorizer
# 创建管道 
pipe = Pipeline([('cv', CountVectorizer()),
                ('svc', MultinomialNB())])

# 拟合管道 
pipe.fit(X_train, y_train)
# 创建分类报告并绘制混淆矩阵
preds = pipe.predict(X_test)
print(classification_report(y_test, preds))
ConfusionMatrixDisplay.from_estimator(pipe, X_test, y_test)
Count Vectorizer给出了96%的准确率，现在让我们尝试TF-IDF
TF-IDF

词频-逆文档频率(Term Frequency- inverse Document Frequency, TF-IDF)向量化器是一种在自然语言处理(natural language processing, NLP)中用于将原始文档集合转换为适合机器学习算法的数值表示的流行技术。
# 创建管道
pipe = Pipeline([('tfidf', TfidfVectorizer()),
                ('svc', MultinomialNB())])

# 拟合管道 
pipe.fit(X_train, y_train)
# 创建分类报告并绘制混淆矩阵
preds = pipe.predict(X_test)
print(classification_report(y_test, preds))
ConfusionMatrixDisplay.from_estimator(pipe, X_test, y_test)
predict_train = pipe.fit(X_train, y_train).predict(X_train)
# 训练数据集的准确得分
accuracy_train = accuracy_score(y_train,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)
# 在测试数据集上预测目标
predict_test = pipe.predict(X_test)
#测试数据集上的准确度得分
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
与TF-IDF(0.94)相比，CountVectorizer提供了更好的精度(0.96)。

资料获取，更多粉丝福利，关注下方公众号获取

在这里插入图片描述