python数据分析之利用多种机器学习方法实现文本分类、情感预测

大家好，我是带我去滑雪！

文本分类是一种机器学习和自然语言处理（NLP）任务，旨在将给定的文本数据分配到预定义的类别或标签中。其目标是为文本数据提供自动分类和标注，使得可以根据其内容或主题进行组织、排序和分析。文本分类在各种应用场景中广泛应用，包括情感分析、垃圾邮件过滤、新闻分类、推荐系统等。
文本分类的关键步骤包括：

数据准备：准备训练集和测试集的文本数据，每个文本数据都经过标记或分类。
特征提取：从文本数据中提取有用的特征来表示文本。常见的特征提取方法包括词袋模型（Bag-of-Words Model）、TF-IDF（Term Frequency-Inverse Document Frequency）、词嵌入（Word Embeddings）等。
训练模型：使用已标记的训练数据来训练分类模型。常见的机器学习算法包括朴素贝叶斯（Naive Bayes）、支持向量机（Support Vector Machines，SVM）、决策树（Decision Trees）、随机森林（Random Forests）等。最近，深度学习方法如卷积神经网络（Convolutional Neural Networks，CNN）和循环神经网络（Recurrent Neural Networks，RNN）也被广泛应用于文本分类任务。
模型评估：使用预留的测试数据对训练好的模型进行评估，计算分类模型的准确性、精确度、召回率等指标。
预测和应用：使用已训练的模型对新的未标记文本数据进行分类和预测。

本期首先利用python抓取百度贴吧中的评论获得文本数据，再对文本数据进行中文分词、数据清洗、特征提取、TF-IDF权重计算等数据预处理，再进行一定的数据分析和数据可视化，最后运用朴素贝叶斯、神经网络、支持向量机、随机森林、逻辑回归、K近邻、决策树、梯度提升共计8种机器学习对文本数据进行分类。

1、抓取百度贴吧评论获取文本数据

（1）代码

（2）部分数据展示

2、数据预处理

（1）中文分词

（2）文本情感打分

（3）将文本数据转化为向量

（4）计算TF-IDF权重

3、数据分析与可视化

（1）统计得分区间数量

（2）得分区间数据可视化

（3）绘制词云图

（4）关键词TOP10

（5）计算积极评论与消极评论数量并数据可视化

4、使用8种机器学习对文本数据进行分类

（1）随机划分，按总样本数的20%划分，即测试集（784个）与训练集（3135个）

（2）调用模型，并对比测试集精度

1、抓取百度贴吧评论获取文本数据

（1）代码

import requests

import time

from bs4 import BeautifulSoup

def get_html(url):

    try:

        kv = {'user-agent':'Mozilla/4.0'} #伪装客户端

        r = requests.get(url,headers = kv,timeout=30)

        r.raise_for_status()

        r.encoding = 'UTF-8'

        #print(r.text[:1000])

        return r.text

    except:

        return "ERROR"

def get_content(url):

    comments = []

    html = get_html(url)

    soup = BeautifulSoup(html,'lxml')

    #with open('b.txt','a+',encoding='utf-8') as f1:

            #f1.write（soup.prettify()）

    liTags = soup.find_all('li', attrs={'class': 'j_thread_list clearfix thread_item_box'})

    for li in liTags:

        comment = {}

        try:

            comment['title'] = li.find('a',attrs={'class':'j_th_tit'}).text.strip()

            comment['link'] = "http://tieba.baidu.com" + li.find('a',attrs={'class': 'j_th_tit'})['href']

            comment['name'] = li.find('span',attrs={'class': 'tb_icon_author'}).text.strip()

            comment['time'] = li.find('span', attrs={'class': 'pull-right is_show_create_time'}).text.strip()

            comment['replyNum'] = li.find('span', attrs={'class': 'threadlist_rep_num center_text'}).text.strip()

            comments.append(comment)

        except:

            print('出了点小问题')

    return comments

def Out2File(dict):

    with open('大学吧的评论最新.txt','a+',encoding='utf-8') as f:

        for comment in dict:

            f.write('标题： {} \t 链接：{} \t 发帖人：{} \t 发帖时间：{} \t 回复数量： {} \n'.format(

                comment['title'], comment['link'], comment['name'], comment['time'], comment['replyNum']))

        print('当前页面爬取完成')



def main(base_url, deep):

    url_list = [] #存取需要爬取的帖子链接

    for i in range(0, deep):

        url_list.append(base_url + '&pn=' + str(50 * i))

    print('所有的网页已经下载到了本地，开始筛选信息。。。。')

    #循环写入数据

    for url in url_list:

        content = get_content(url)

        Out2File(content)

    print('所有信息都已经保存完毕！')

base_url = 'http://tieba.baidu.com/f?kw=大学&ie=utf-8&'

deep =200

if __name__ == '__main__':

main(base_url,deep)

输出结果：

所有的网页已经下载到了本地，开始筛选信息。。。。
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成
当前页面爬取完成

$\vdots$

（2）部分数据展示

1	好好画画啦
2	求各专业大佬
3	欢迎报考北邮
4	话费充值需要dd
5	兼职有没有来的
6	在校大学生一枚
7	滴滴，喜欢的看过来
8	大学生进！！！
9	有什么快速挣钱的好方法？
10	大学，要挣米，来，???带一手
11	大学宿舍限电是普遍现象吗，一般限多少瓦
12	你们认为大学生打工，什么工作最好
13	家人们该不该
14	兼职介绍，有没有
15	稳稳的一天
16	创建一个资源共享群，亲们留下你们的微信，我拉你们进群
17	假期的小工作
18	寻说明书系统说明，撰写选手
19	加QQ！！！..
20	有兼职群吗

2、数据预处理

（1）中文分词

爬取到的评论，使用Python爬取了中文数据集之后，首先需要对数据集进行中文分词处理。由于英文中的词与词之间是采用空格关联的，按照空格可以直接划分词组，所以不需要进行分词处理，而中文汉字之间是紧密相连的，并且存在语义，词与词之间没有明显的分隔点，所以需要借助中文分词技术将语料中的句子按空格分割，变成一段段词序列。使用中文分词技术及Jiaba中文分词工具。

分词后的评论并不是所有的词都与文档内容相关，往往存在一些表意能力很差的辅助性词语，比如中文词组“我们”、“的”、“可以”等，英文词汇“a”、“the”等。这类词在自然语言处理或数据挖掘中被称为停用词（Stop Words），它们是需要进行过滤的。通常借用停用词表或停用词字典进行过滤，这里所用的停用词表可以在文末进行获取。

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import networkx as nx

plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默认字体 SimHei黑体

plt.rcParams['axes.unicode_minus'] = False   #解决保存图像是负号'

import jieba

stop_list = pd.read_csv("停用词.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')

#Jieba分词函数

def txt_cut(juzi):

    lis=[w for w in jieba.lcut(juzi) if w not in stop_list.values]

    return (" ").join(lis)

df=pd.read_csv('E:/工作/硕士/data.csv',encoding="ANSI")

df['cutword']=df['PL'].astype('str').apply(txt_cut)

df=df[['PL','cutword']]

df

输出结果：

（2）文本情感打分

import pandas as pd

data1 = pd.read_csv('E:/工作/硕士/博客//data.csv',encoding="ANSI")

from snownlp import SnowNLP

data1['emotion'] = data1['PL'].apply(lambda x:SnowNLP(x).sentiments)

data1.to_excel('评论情感打分值.xlsx',index=False)#保存数据，具体数据看表格

部分结果展示：

序号	PL	emotion
1	周末了，同学们，哪里可以下单机游戏玩？	0.936292288
2	大学生等一个	0.753614865
3	每位同学顺利毕业??是我最大的心愿??	0.994685253
4	了解过咸鱼吧	0.5
5	学习通看完了吗，没看完找我呀	0.761398932
6	特凉列肮删	0.066063183
7	有没有安徽的	0.77536362
8	寄快递也可以赚钱，你们满意嘛？	0.112413565
9	618可以组队的羊毛群，（超红口令很早就知道）	0.100694453
10	大家晚上好	0.529099344
11	同学们大家好	0.87008235
12	放纵的代价	0.738870215
13	大学生第一次上台演讲ppt	0.853786676
14	有没有正常的	0.353878603
15	纯绿色搬砖，多劳多得	0.154165573

（3）将文本数据转化为向量

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

#取出X和y

X = df['text']

y = df['label']

#创建一个TfidfVectorizer的实例

vectorizer = TfidfVectorizer()

#使用Tfidf将文本转化为向量

X = vectorizer.fit_transform(X)

#看看特征形状

X.shape

（4）计算TF-IDF权重

data1 = {'word': vectorizer.get_feature_names_out(),

'tfidf': X.toarray().sum(axis=0).tolist()}

df1 = pd.DataFrame(data1).sort_values(by="tfidf" ,ascending=False,ignore_index=True)

df1

部分结果展示：

3、数据分析与可视化

（1）统计得分区间数量

from itertools import groupby

score_list =data1['emotion']

step = 0.1

for k, g in groupby(sorted(score_list), key=lambda x: x//step):

print('{}-{}: {}'.format(k*step, (k+1)*step+1, len(list(g))))

输出结果：

（2）得分区间数据可视化

%matplotlib inline

from matplotlib import pyplot as plt

from matplotlib import font_manager

a = ["0-0.1", "0.1-0.2","0.2-0.3","0.3-0.4","0.4-0.5","0.5-0.6","0.6-0.7","0.7-0.8","0.8-0.9","0.9-1"]

b = [265,248,259,329,348,319,329,375,439,1064]

plt.figure(figsize=(20,8),dpi=300)

my_font=font_manager.FontProperties(fname=r"C:\Windows\Fonts\STSONG.TTF",size=12)

rects=plt.bar(a,b,width=0.3,color=['red','green','blue','cyan','yellow','gray'])

plt.xticks(a,fontproperties=my_font,rotation=45)

plt.xlabel("情感得分区间",fontproperties=my_font,fontsize=20)

plt.ylabel("数量",fontproperties=my_font,fontsize=20) #rotation='horizontal'

#plt.grid(alpha=0.5)

for rect in rects:

    y=rect.get_height()

    x=rect.get_x()+rect.get_width()/2

    plt.text(x,y+0.5,str(y),ha="center",fontsize=15)

plt.title("情感分区间条形图",fontproperties=my_font,fontsize=20)

plt.savefig("squares2.png",

            bbox_inches ="tight",

            pad_inches = 1,

            transparent = True,

            facecolor ="w",

            edgecolor ='w',

            dpi=300,

            orientation ='landscape')

输出结果：

（3）绘制词云图

from wordcloud import WordCloud

import jieba

w = WordCloud(font_path="msyh.ttc",background_color="white",max_words=500,width=1000,height=600) #font_path="msyh.ttc"，设置字体，否则显示不出来

text = ''

for s in data1['PL']:

text += s

data_cut = ' '.join(jieba.lcut(text))

w.generate(data_cut)

image = w.to_file('词云图.png')

输出结果：

（4）关键词TOP10

from jieba import analyse

key_words = jieba.analyse.extract_tags(sentence=text, topK=10, withWeight=True, allowPOS=())

key_words

输出结果：

（5）计算积极评论与消极评论数量并数据可视化

#计算积极评论与消极评论各自的数目

pos = 0

neg = 0

for i in data1['emotion']:

    if i >= 0.5:

        pos += 1

    else:

        neg += 1

print('积极评论，消极评论数目分别为：')

pos,neg

输出结果：

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif']=['SimHei']

plt.rcParams['axes.unicode_minus'] = False

pie_labels='postive','negative'

plt.pie([pos,neg],labels=pie_labels,autopct='%1.1f%%',shadow=True)

plt.savefig("squares3.png",

            bbox_inches ="tight",

            pad_inches = 1,

            transparent = True,

            facecolor ="w",

            edgecolor ='w',

            dpi=300,

            orientation ='landscape')

输出结果：

4、使用8种机器学习对文本数据进行分类

（1）随机划分，按总样本数的20%划分，即测试集（784个）与训练集（3135个）

X_train, X_test, y_train, y_test =train_test_split(X,y,test_size=0.2,stratify=y,random_state = 0)

#可以检查一下划分后数据形状

X_train.shape,X_test.shape, y_train.shape, y_test.shape

（2）调用模型，并对比测试集精度

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import MultinomialNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.svm import SVC

from sklearn.neural_network import MLPClassifier


model1 =  LogisticRegression(C=1e10,max_iter=10000)
model2 = MultinomialNB()
model3 = KNeighborsClassifier(n_neighbors=50)
model4 = DecisionTreeClassifier(random_state=77)
model5= RandomForestClassifier(n_estimators=500,  max_features='sqrt',random_state=10)
model6 = GradientBoostingClassifier(random_state=123)
model9 = SVC(kernel="rbf", random_state=77)
model10 = MLPClassifier(hidden_layer_sizes=(16,8), random_state=77, max_iter=10000)
model_list=[model1,model2,model3,model4,model5,model6,model9,model10]
model_name=['逻辑回归','朴素贝叶斯','K近邻','决策树','随机森林','梯度提升','支持向量机','神经网络']

scores=[]
for i in range(len(model_list)):
    model_C=model_list[i]
    name=model_name[i]
    model_C.fit(X_train, y_train)
    s=model_C.score(X_test, y_test)
    scores.append(s)
    print(f'{name}方法在测试集的准确率为{round(s,3)}')

plt.figure(figsize=(7,3),dpi=128)
sns.barplot(y=model_name,x=scores,orient="h")
plt.xlabel('模型准确率')
plt.ylabel('模型名称')
plt.xticks(fontsize=10,rotation=45)
plt.title("不同模型文本分类准确率对比")
plt.savefig("squares4.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            facecolor ="w",
            edgecolor ='w',
            dpi=300,
            orientation ='landscape')

输出结果：