垃圾邮件检测_TF-IDF分析，聚类分析与朴素贝叶斯

数据入口：基于机器学习的垃圾信息识别分类 - Heywhale.com

本数据集专为邮件和短信的垃圾信息分类设计，适合建立垃圾邮件检测模型。

数据说明

字段名	说明
message_content	邮件或短信的正文内容
is_spam	标签，指示该消息是否为垃圾信息（1表示垃圾邮件，0表示非垃圾邮件）

在本文中主要包含垃圾和非垃圾邮件模式剖析以及利用机器学习模型对垃圾邮件过滤器进行训练和测试。

一：垃圾邮件模式分析

为了分析垃圾邮件中的常见模式，例如促销优惠、钓鱼攻击等，接下来我们可以仅选择标签为1的垃圾邮件，并对这些邮件的内容进行文本分析。首先，我们可以使用TF-IDF技术来识别垃圾邮件中的关键词。这将帮助我们了解哪些词在垃圾邮件中最常见。然后，我们可以使用一些基本的文本处理技术，如分词和去除停用词，来进一步分析文本数据。最后，我们将总结垃圾邮件中的常见模式。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

file_path = 'spam_dataset.csv'
data = pd.read_csv(file_path)

spam_messages = data[data['is_spam'] == 1]['message_content']

vectorizer = TfidfVectorizer(max_features=20, stop_words='english')
X = vectorizer.fit_transform(spam_messages)

feature_names = vectorizer.get_feature_names_out()

top_words = np.argsort(X.sum(axis=0)).flatten()[::-1]
top_words_in_spam = [feature_names[i] for i in top_words]

top_words_in_spam

从TF-IDF分析中，我们可以看到垃圾邮件中最常见的词汇包括：“avoid”, “verify”, “fast”, “win”, “free”, “time”, “special”, “limited”, “account”, “claim”, “offer”, “don’t”, “miss”, “act”, “click”, “directly”, “visit”, “website”, “contact”, 和 “details”。

这些词汇可以帮助我们识别垃圾邮件的一些常见模式，例如：

促销优惠：词汇如 “free”, “special”, “limited”, “offer”, “act now” 和 “time” 通常与促销优惠相关。
钓鱼攻击：词汇如 “verify”, “account”, “claim” 和 “details” 可能表明这是一封钓鱼邮件，试图获取个人信息。
紧急性：词汇如 “fast”, “don’t miss” 和 “directly” 通常用来创造紧迫感，促使收件人迅速采取行动。

要进一步细分垃圾邮件类型，我们可以使用无监督学习技术，如聚类分析。我们可以使用TF-IDF向量作为聚类的特征。选择一个合适的聚类算法，如K-means，并选择一个合适的聚类数量。最后对聚类结果进行解释，以识别不同的垃圾邮件类型。

from sklearn.cluster import KMeans

X_spam = X.todense()
num_clusters = 3
X_spam_array = np.array(X_spam)

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X_spam_array)

cluster_labels = kmeans.labels_

data_spam = data[data['is_spam'] == 1]
data_spam['cluster_label'] = cluster_labels

data_spam.head()

我们已经成功地对垃圾邮件进行了聚类分析，并将聚类标签添加到了数据框中。现在，我们可以进一步分析每个聚类中的邮件内容，以识别不同的垃圾邮件类型。为了更好地理解每个聚类的内容，我们可以为每个聚类提取一些代表性的关键词。这将帮助我们识别每个聚类的主题或模式。

from sklearn.cluster import KMeans

X_spam = X.todense()
num_clusters = 3
X_spam_array = np.array(X_spam)

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X_spam_array)

cluster_labels = kmeans.labels_

data_spam = data[data['is_spam'] == 1]
data_spam['cluster_label'] = cluster_labels

data_spam.head()

根据聚类分析，我们可以将垃圾邮件分为以下几种类型：

集群 0

关键词：‘click’, ‘details’, ‘contact’, ‘website’, ‘visit’

描述：这个集群可能与引导点击链接有关，可能是为了收集个人信息或引导到恶意网站。

集群 1

关键词：‘limited’, ‘time’, ‘offer’, ‘free’, ‘act’

描述：这个集群可能与促销优惠有关，强调时间的紧迫性和免费优惠。

集群 2

关键词：‘verify’, ‘account’, ‘avoid’, ‘details’, ‘contact’

描述：这个集群可能与钓鱼攻击有关，试图诱骗收件人提供账户信息或其他个人信息。

请注意，这些类型是基于文本内容的初步分析，更深入的分析可能需要考虑更多的上下文信息。

二：非垃圾邮件模式分析

为了分析非垃圾邮件模式，首先我们将筛选出非垃圾邮件（标记为0的邮件）。接下来，我们将对非垃圾邮件的内容进行分析，以识别常见的内容格式，如会议提醒、项目更新等。最后，我们将总结分析结果，并考虑如何以可视化的方式展示这些信息。

non_spam_messages = data[data['is_spam'] == 0]['message_content']
non_spam_messages.head()

我们已经筛选出了非垃圾邮件。接下来，我们将对这些邮件的内容进行分析，以识别常见的内容格式，如会议提醒、项目更新等。由于这是一个文本分析任务，我们可以考虑使用自然语言处理（NLP）技术来帮助我们识别邮件中的模式和主题。一种简单的方法是查看邮件中经常出现的词汇和短语，这可以帮助我们了解邮件的一般内容和格式。让我们开始分析邮件内容，并尝试识别常见的词汇和短语。

from collections import Counter
import re

words = []
for message in non_spam_messages:
    words.extend(re.findall(r'\b\w+\b', message.lower()))

word_counts = Counter(words)

common_words = word_counts.most_common(20)

common_words

从非垃圾邮件中提取的常见词汇包括：

‘you’, ‘to’, ‘please’, ‘the’, ‘any’, ‘have’, ‘questions’, ‘if’, ‘i’, ‘out’, ‘your’, ‘reach’, ‘feel’, ‘free’, ‘attached’, ‘let’, ‘on’, ‘know’, ‘me’, ‘project’

这些词汇可以帮助我们了解非垃圾邮件的一般内容和格式。例如，词汇如 ‘questions’, ‘attached’, ‘project’ 可能暗示着邮件涉及项目更新或信息分享。而词汇如 ‘please’, ‘feel free’, ‘reach out’ 可能表明邮件是在请求或提供帮助。接下来先以可视化的方式展示这些常见词汇，以便更好地理解非垃圾邮件的模式。

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

words, counts = zip(*common_words)
plt.figure(figsize=(12, 8))
plt.barh(words, counts, color='skyblue')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.title('Most Common Words in Non-Spam Messages')
plt.gca().invert_yaxis()
plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(integer=True)) 
plt.show()

上图展示了非垃圾邮件中最常见的词汇及其出现的频率。从图中可以看出，词汇如 ‘you’, ‘to’, ‘please’ 等在非垃圾邮件中非常常见，这可能与邮件的礼貌性和交流性质有关。

接下来，要识别与特定主题（如会议提醒、项目更新等）相关的关键词汇，我们可以首先定义一些与这些主题相关的关键词或短语。然后，我们将搜索非垃圾邮件中包含这些关键词或短语的邮件。最后，我们将分析这些邮件，以识别与每个主题相关的常见词汇和短语。例如，对于会议提醒，我们可以考虑关键词如 ‘meeting’, ‘reminder’, ‘schedule’ 等。对于项目更新，我们可以考虑关键词如 ‘project’, ‘update’, ‘status’ 等。

下面我们将搜索非垃圾邮件中包含这些关键词的邮件，并分析这些邮件的内容。

meeting_keywords = ['meeting', 'reminder', 'schedule', 'agenda', 'time', 'date']
project_keywords = ['project', 'update', 'status', 'progress', 'report']

meeting_messages = non_spam_messages[non_spam_messages.str.contains('|'.join(meeting_keywords), case=False, na=False)]

project_messages = non_spam_messages[non_spam_messages.str.contains('|'.join(project_keywords), case=False, na=False)]

num_meeting_messages = len(meeting_messages)
num_project_messages = len(project_messages)

num_meeting_messages, num_project_messages

我们找到了277封与会议提醒相关的邮件和283封与项目更新相关的邮件。

接下来，我们将分别分析这些邮件，以识别与每个主题相关的常见词汇和短语。我们将对每个主题的邮件进行分词，并统计每个词汇的出现频率。这样可以帮助我们了解与会议提醒和项目更新相关的常见内容和格式。

meeting_words = []
for message in meeting_messages:
    meeting_words.extend(re.findall(r'\b\w+\b', message.lower()))

meeting_word_counts = Counter(meeting_words)

project_words = []
for message in project_messages:
    project_words.extend(re.findall(r'\b\w+\b', message.lower()))

project_word_counts = Counter(project_words)

common_meeting_words = meeting_word_counts.most_common(20)
common_project_words = project_word_counts.most_common(20)

common_meeting_words, common_project_words

对于会议提醒邮件，最常见的词汇包括：

‘you’, ‘to’, ‘please’, ‘the’, ‘any’, ‘if’, ‘have’, ‘questions’, ‘i’, ‘out’, ‘reach’, ‘feel’, ‘free’, ‘your’, ‘on’, ‘let’, ‘know’, ‘attached’, ‘project’, ‘me’

对于项目更新邮件，最常见的词汇包括：

‘you’, ‘the’, ‘to’, ‘please’, ‘any’, ‘if’, ‘have’, ‘questions’, ‘i’, ‘out’, ‘your’, ‘feel’, ‘free’, ‘reach’, ‘project’, ‘on’, ‘attached’, ‘let’, ‘know’, ‘me’

从这些结果中可以看出，会议提醒和项目更新邮件中有许多共同的常见词汇，如 ‘you’, ‘to’, ‘please’, ‘any’, ‘if’, ‘have’, ‘questions’ 等。这些词汇可能与邮件的礼貌性和交流性质有关。同时，我们也可以看到一些与特定主题相关的词汇，如 ‘meeting’, ‘reminder’, ‘schedule’（会议提醒）和 ‘project’, ‘update’, ‘status’（项目更新）。我们可以先以可视化的方式展示这些常见词汇，以便更好地理解会议提醒和项目更新邮件的模式。

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

words, counts = zip(*common_meeting_words)
ax1.barh(words, counts, color='lightgreen')
ax1.set_xlabel('Frequency')
ax1.set_ylabel('Words')
ax1.set_title('Most Common Words in Meeting Reminder Messages')
ax1.invert_yaxis() 
ax1.xaxis.set_major_locator(ticker.MaxNLocator(integer=True)) 

words, counts = zip(*common_project_words)
ax2.barh(words, counts, color='lightblue')
ax2.set_xlabel('Frequency')
ax2.set_ylabel('Words')
ax2.set_title('Most Common Words in Project Update Messages')
ax2.invert_yaxis() 
ax2.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))  
plt.tight_layout()
plt.show()

三：模型训练和评估

首先进行一些基本的数据探索，比如查看数据集的大小、缺失值、以及垃圾邮件和非垃圾邮件的分布情况。然后进行文本处理，由于垃圾邮件过滤是一个文本分类问题，需要对邮件内容进行文本处理，比如去除停用词、词干提取等。然后，需要从处理后的文本中提取特征，常用的方法有词袋模型、TF-IDF等。随后选择合适的机器学习模型进行训练，比如朴素贝叶斯、逻辑回归等。最后使用测试数据评估模型的性能，比如计算准确率、召回率等指标。

dataset_size = len(data)
spam_distribution = data['is_spam'].value_counts()
dataset_size, spam_distribution

数据集共有 1000 条记录，其中垃圾邮件（标记为 1）和非垃圾邮件（标记为 0）各占一半，即各有 500 条。下一步继续进行文本处理，使用一个简单的英文停用词列表，并手动实现文本分词和词干提取的过程：

predefined_stopwords = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves",
    "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their",
    "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was",
    "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and",
    "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between",
    "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off",
    "over", "under", "again", "further", "then", "once"
])

def simple_stemmer(word):
    suffixes = ('ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment')
    stem = word
    for suffix in suffixes:
        if word.endswith(suffix):
            stem = word[:-len(suffix)]
            break
    return stem

def preprocess_text_v2(text):
    words = text.lower().split()
    processed_words = [simple_stemmer(word) for word in words if word.isalnum() and word not in predefined_stopwords]
    return ' '.join(processed_words)

data['processed_text_v2'] = data['message_content'].apply(preprocess_text_v2)

data[['message_content', 'processed_text_v2']].head()

文本已成功处理。接下来，我将从处理后的文本中提取特征。这里，我计划使用 TF-IDF（词频-逆文档频率）方法来转换文本数据为数值特征，这种方法可以有效地表示文本中单词的重要性：

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=500, stop_words='english')

X = vectorizer.fit_transform(data['processed_text_v2'])

X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

在特征提取完成之后。接下来，我将使用这些特征来训练一个机器学习模型。考虑到这是一个文本分类问题，我计划使用朴素贝叶斯分类器，它是一种常用的文本分类算法，特别是对于垃圾邮件过滤这种二分类问题效果很好。下面继续进行模型训练：

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, data['is_spam'], test_size=0.2, random_state=42)

nb_classifier = MultinomialNB()

nb_classifier.fit(X_train, y_train)

y_pred = nb_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

report = classification_report(y_test, y_pred, target_names=['Non-Spam', 'Spam'])

accuracy, report

模型训练和测试已完成。使用朴素贝叶斯分类器，我们得到了 100% 的准确率，这意味着模型在测试集上完美地区分了垃圾邮件和非垃圾邮件，这是一个非常好的结果。

想要探索多元化的数据分析视角，可以关注之前发布的相关内容。