项目实战：电影评论情感分析系统

news2026/2/14 20:14:43

1.引言

2.数据获取与预处理

3.构建文本分类模型（使用LSTM）

4.结果评估与模型优化

4.2.结果评估

4.2.模型优化

5.总结

1.引言

在本篇文章中，将通过一个完整的项目实战来演示如何使用Python构建一个电影评论情感分析系统。涵盖从数据获取与预处理，到构建文本分类模型（使用长短期记忆网络LSTM），再到结果评估与模型优化的全过程。以下是详细的步骤和代码实例。更多Python在人工智能中的使用方法，欢迎关注《Python人工智能实战》栏目！

2.数据获取与预处理

首先，我们需要一个标注好的电影评论数据集。这个数据集应该包含电影评论的文本内容以及对应的情感标签（正面或负面）。

这里假设我们已经有了一个CSV文件movie_reviews.csv，其中包含两列：text（评论内容）和sentiment（情感标签，0代表负面，1代表正面）。

以下是一个简化的movie_reviews.csv示例数据，其中包含了电影评论的文本内容以及对应的情感标签（0代表负面，1代表正面）

text,sentiment  
"This movie was terrible. I didn't enjoy it at all.",0  
"I loved this movie! It was so exciting and engaging.",1  
"The acting was bad and the plot was boring.",0  
"The storyline was great and the characters were well-developed.",1  
"I wouldn't recommend this movie to anyone.",0  
"This was one of the best movies I've seen this year.",1  
"The special effects were amazing, but the story was weak.",0  
"The movie kept me on the edge of my seat the whole time.",1  
"I found the movie to be quite disappointing.",0  
"The acting and cinematography were both top-notch.",1

接下来，对数据进行预处理：

文本清洗：去除标点符号、特殊字符、停用词等。
文本分词：将句子分割成单词或词组。
文本向量化：将文本转换为数值型向量，以便模型能够处理。

其中关于文本预处理的更多方法请看这篇文章：自然语言处理（NLP）基础：文本预处理与词嵌入-CSDN博客文章浏览阅读353次，点赞19次，收藏8次。自然语言处理（Natural Language Processing, NLP）是计算机科学与人工智能领域的重要分支，旨在使计算机理解、生成和处理人类语言。本文将聚焦于NLP的基础环节：文本预处理与词嵌入。首先，详细介绍文本预处理的步骤与常用工具（如NLTK、Spacy），然后解析词嵌入模型（如Word2Vec、GloVe）的概念、特点与应用。https://blog.csdn.net/meijinbo/article/details/137022427 下面是数据预处理的Python代码：

import pandas as pd  
import numpy as np  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.utils import to_categorical  
  
# 加载数据  
data = pd.read_csv('movie_reviews.csv')  
  
# 文本清洗和分词  
def preprocess_text(text):  
    # 这里可以添加更多的文本清洗步骤  
    text = text.lower()  # 转为小写  
    text = text.replace("[^a-zA-Z]", " ")  # 去除标点符号  
    words = text.split()  # 分词  
    return words  
  
# 对所有评论进行预处理  
processed_texts = data['text'].apply(preprocess_text)  
  
# 创建Tokenizer并拟合数据  
tokenizer = Tokenizer()  
tokenizer.fit_on_texts(processed_texts)  
  
# 将文本转换为序列  
sequences = tokenizer.texts_to_sequences(processed_texts)  
  
# 确定最大序列长度，并对序列进行填充或截断  
max_seq_len = max([len(seq) for seq in sequences])  
X = pad_sequences(sequences, maxlen=max_seq_len)  
  
# 将情感标签转换为one-hot编码  
y = to_categorical(data['sentiment'])  
  
# 划分训练集和测试集  
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.构建文本分类模型（使用LSTM）

现在，我们将使用Keras构建一个简单的LSTM模型来进行情感分析。模型包含嵌入层（Embedding）、LSTM层以及全连接输出层。

from keras.models import Sequential  
from keras.layers import Embedding, LSTM, Dense  
  
# 定义模型结构  
model = Sequential()  
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64, input_length=max_seq_len))  
model.add(LSTM(128))  
model.add(Dense(1, activation='sigmoid'))  
  
# 编译模型  
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  
  
# 训练模型  
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

4.结果评估与模型优化

4.2.结果评估

通过观察训练过程中的损失和准确率曲线，以及在测试集上的性能指标，评估模型的泛化能力。

Python代码：

import matplotlib.pyplot as plt

def plot_history(history):
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()

    plt.show()

plot_history(history)  # 显示训练过程中的准确率与损失曲线

# 测试集性能
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss:.4f}')
print(f'Test Accuracy: {test_acc:.4f}')