使用朴素贝叶斯过滤垃圾邮件

news2024/11/25 14:58:28

使用朴素贝叶斯过滤垃圾邮件

文章目录

  • 使用朴素贝叶斯过滤垃圾邮件
    • 流程
      • 准备数据
      • 构建训练集和测试集
      • 分类电子邮件
    • 完整测试代码:
    • 欢迎关注公众号【三戒纪元】

朴素贝叶斯的最著名的应用:过滤电子邮件垃圾。

流程

  1. 收集数据:提供文本文件
  2. 准备数据:将文本文件解析成词条向量
  3. 分析数据:检查词条确保解析的正确性
  4. 训练算法:使用 trainNB0() 函数
  5. 测试算法:使用 classifyNB(),并且构建一个新的测试函数就散文档集的错误率
  6. 使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上

准备数据

使用正则表达式切分数据,将邮件内容拆分成单词向量,并转换为小写

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

构建训练集和测试集

两个集合中的邮件都是随机选出的,本例有50封邮件,其中10封电子邮件被随机选择为测试集。

分类器中的概率计算只利用训练集中的文档完成。

    docList=[]; classList = []; fullText =[]
    # 读取邮件内容,创建词条向量
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
   
# 测试集
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
   
   # 训练集
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])

选择出的数字所对应的文档被添加到测试集,同时也将其从训练集中删除。

这种随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程成为留存交叉验证(hold-out cross validation)

分类电子邮件

遍历训练集的所有文档,对每封邮件基于词汇表并使用setOfWords2Vec() 函数构建词向量。

这些词在 trainNB0() 函数中用于计算分类所需的概率。

遍历数据测试集,对每封电子邮件进行分类。

如果邮件分类错误,则错误数 +1 ,然后给出总的错误百分比

完整测试代码:

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    err_rate = float(errorCount)/len(testSet)
    print ('the error rate is: ', err_rate)
    #return vocabList,fullText
    return err_rate

为了更精确地估计分类器的错误率,就应该进行多次迭代后求出平均错误率。

    print("******* 文件解析及完整的垃圾邮件测试函数 spamTest ******* ")
    error_rate = 0.0
    for i in range(10):
        error_rate += spamTest()
    print("average error rate: {}".format(error_rate/10.0))

结果:

classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
the error rate is:  0.6
classification error ['ryan whybrew commented on your status.\n\nryan wrote:\n"turd ferguson or butt horn."\n']
classification error ["we saw this on the way to the coast...thought u might like it\n\nhangzhou is huge, one day wasn't enough, but we got a glimpse...\n\nwe went inside the china pavilion at expo, it is pretty interesting,\neach province has an exhibit..."]
classification error ["i've thought about this and think it's possible. we should get another\nlunch. i have a car now and could come pick you up this time. does\nthis wednesday work? 11:50?\n\ncan i have a signed copy of you book?"]
classification error ["\nscifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new nvidia fermi-class tesla 20-series gpu.\n\nscifinance® is a derivatives pricing and risk model development tool that automatically generates c/c++ and gpu-enabled source code from concise, high-level model specifications. no parallel computing or cuda programming expertise is required.\n\nscifinance's automatic, gpu-enabled monte carlo pricing model source code generation capabilities have been significantly extended in the latest release. this includes:\n\n"]
classification error ['hi peter,\n\nwith jose out of town, do you want to\nmeet once in a while to keep things\ngoing and do some interesting stuff?\n\nlet me know\neugene']
classification error ["yo.  i've been working on my running website.  i'm using jquery and the jqplot plugin.  i'm not too far away from having a prototype to launch.  \n\nyou used jqplot right?  if not, i think you would like it."]
classification error ['ok i will be there by 10:00 at the latest.']
the error rate is:  0.7
the error rate is:  0.0
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['buyviagra 25mg, 50mg, 100mg,\nbrandviagra, femaleviagra from $1.15 per pill\n\n\nviagranoprescription needed - from certified canadian pharmacy\n\nbuy here... we accept visa, amex, e-check... worldwide delivery']
the error rate is:  0.2
classification error ['benoit mandelbrot 1924-2010\n\nbenoit mandelbrot 1924-2010\n\nwilmott team\n\nbenoit mandelbrot, the mathematician, the father of fractal mathematics, and advocate of more sophisticated modelling in quantitative finance, died on 14th october 2010 aged 85.\n\nwilmott magazine has often featured mandelbrot, his ideas, and the work of others inspired by his fundamental insights.\n\nyou must be logged on to view these articles from past issues of wilmott magazine.']
classification error ['that is cold.  is there going to be a retirement party?  \nare the leaves changing color?']
classification error ['zach hamm commented on your status.\n\nzach wrote:\n"doggy style - enough said, thank you & good night"\n\n\n']
classification error ["hi peter,\n\nthese are the only good scenic ones and it's too bad there was a girl's back in one of them. just try to enjoy the blue sky : ))\n\nd"]
classification error ["hi hommies,\n\njust got a phone call from the roofer, they will come and spaying the foaming today. it will be dusty. pls close all the doors and windows.\ncould you help me to close my bathroom window, cat window and the sliding door behind the tv?\ni don't know how can those 2 cats survive......\n\nsorry for any inconvenience!"]
classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
the error rate is:  0.6
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['experience with biggerpenis today! grow 3-inches more\n\nthe safest & most effective methods of_penisen1argement.\nsave your time and money!\nbettererections with effective ma1eenhancement products.\n\n#1 ma1eenhancement supplement. trusted by millions. buy today!']
the error rate is:  0.2
classification error ['a home based business opportunity is knocking at your door.\n\ndon\x92t be rude and let this chance go by.\n\nyou can earn a great income and find\nyour financial life transformed.\n\nlearn more here.\n\n\n\nto your success.\n\nwork from home finder experts\n']
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['codeine (the most competitive price on net!)\n\ncodeine (wilson) 30mg x 30 $156.00\ncodeine (wilson) 30mg x 60 $291.00 (+4 freeviagra pills)\ncodeine (wilson) 30mg x 90 $396.00 (+4 freeviagra pills)\ncodeine (wilson) 30mg x 120 $492.00 (+10 freeviagra pills)']
the error rate is:  0.3
classification error ["yo.  i've been working on my running website.  i'm using jquery and the jqplot plugin.  i'm not too far away from having a prototype to launch.  \n\nyou used jqplot right?  if not, i think you would like it."]
classification error ['hi peter,\n\n    sure thing.  sounds good.  let me know what time would be good for you.\ni will come prepared with some ideas and we can go from there.\n\nregards,\n\n-vivek.']
classification error ["\nscifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new nvidia fermi-class tesla 20-series gpu.\n\nscifinance® is a derivatives pricing and risk model development tool that automatically generates c/c++ and gpu-enabled source code from concise, high-level model specifications. no parallel computing or cuda programming expertise is required.\n\nscifinance's automatic, gpu-enabled monte carlo pricing model source code generation capabilities have been significantly extended in the latest release. this includes:\n\n"]
classification error ['linkedin\n\njulius o requested to add you as a connection on linkedin:\n\nhi peter.\n\nlooking forward to the book!\n\n \naccept \tview invitation from julius o\n']
classification error ["thanks peter.\n\ni'll definitely check in on this. how is your book\ngoing? i heard chapter 1 came in and it was in \ngood shape. ;-)\n\ni hope you are doing well.\n\ncheers,\n\ntroy"]
classification error ['jay stepp commented on your status.\n\njay wrote:\n""to the" ???"\n\n\nreply to this email to comment on this status.\n\nto see the comment thread, follow the link below:\n\n']
classification error ['this e-mail was sent from a notification-only address that cannot accept incoming e-mail. please do not reply to this message.\n\nthank you for your online reservation. the store you selected has located the item you requested and has placed it on hold in your name. please note that all items are held for 1 day.  please note store prices may differ from those online.\n\nif you have questions or need assistance with your reservation, please contact the store at the phone number listed below. you can also access store information, such as store hours and location, on the web at http://www.borders.com/online/store/storedetailview_98.']
the error rate is:  0.7
classification error ['zach hamm commented on your status.\n\nzach wrote:\n"doggy style - enough said, thank you & good night"\n\n\n']
classification error ['hi peter,\n \nthe hotels are the ones that rent out the tent. they are all lined up on the hotel grounds : )) so much for being one with nature, more like being one with a couple dozen tour groups and nature.\ni have about 100m of pictures from that trip. i can go through them and get you jpgs of my favorite scenic pictures.\n \nwhere are you and jocelyn now? new york? will you come to tokyo for chinese new year? perhaps to see the two of you then. i will go to thailand for winter holiday to see my mom : )\n \ntake care,\nd\n']
classification error ["what is going on there?\ni talked to john on email.  we talked about some computer stuff that's it.\n\ni went bike riding in the rain, it was not that cold.\n\nwe went to the museum in sf yesterday it was $3 to get in and they had\nfree food.  at the same time was a sf giants game, when we got done we\nhad to take the train with all the giants fans, they are 1/2 drunk."]
classification error ['that is cold.  is there going to be a retirement party?  \nare the leaves changing color?']
classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
classification error ["linkedin\n\nkerry haloney requested to add you as a connection on linkedin:\n\npeter,\n\ni'd like to add you to my professional network on linkedin.\n\n- kerry haloney\n \n"]
the error rate is:  0.6
average error rate: 0.45

Process finished with exit code 0

欢迎关注公众号【三戒纪元】

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/574466.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【ISO14229_UDS刷写】-3-$36诊断服务TransferData理论部分

总目录:(单击下方链接皆可跳转至专栏总目录) 《UDS/OBD诊断需求编辑工具》总目录https://blog.csdn.net/qfmzhu/article/details/123697014 目录 1 $0x36 TransferData诊断服务描述 2 0x36服务请求消息 2.1 0x36服务请求消息定义 2.2 0…

【C++系列P1】带上这篇基础小宝典,向特性奇多的C++进发吧!勇士们!(持续更新ing~)

前言 大家好吖,欢迎来到 YY 滴 C系列 ,热烈欢迎!(持续更新ing~)本章主要内容面向刚刚学完C语言,准备或正在接触C的老铁。而往往C奇多的小特性和知识点让铁铁们头晕晕脑涨涨,因而本章收纳了许多C中零散的知识…

跨境电商系统源码分享,解决你的电商难题

作为跨境电商领域的专家,我在这里为你分享跨境电商系统源码,帮助你解决各种电商难题!本文将为你提供全面而专业的指导,让你的电商之路更加顺畅。 为什么选择跨境电商系统源码? 跨境电商系统源码是现代电商业务管理的…

MongoDB超全语法大全

MongoDB 安装教程 一、介绍 mongodb数据库是非关系数据库,mongodb中没有表的概念,数据都是存储在集合中 易扩展: NoSQL数据库种类繁多, 但是⼀个共同的特点都是去掉关系数据库的关系型特性。 数据之间⽆关系, 这样就…

如何运行Node.js脚本及读取环境变量

目录 1、如何从CLI 运行Node.js 脚本 2、将字符串作为参数传递到节点,而不是文件路径 3、自动重新启动应用程序 4、如何从Node.js中读取环境变量 1、如何从CLI 运行Node.js 脚本 运行Node.js程序的通常方法是运行全局可用的Node命令(一旦安装Node.js…

[论文分享] When deep learning met code search

When deep learning met code search [ESEC/FSE 2019] Jos Cambronero MIT CSAIL U.S.A. Hongyu Li Facebook, Inc. U.S.A. SeohyunKim Facebook,Inc. U.S.A. KoushikSen EECSDepartment,UCBerkeley U.S.A. SatishChandra Facebook,Inc. U.S.A. 最近有多个关于使用深度神经网…

ASP.NET 未能找到类型或命名空间名称“HttpRequestMessage”

引入System.Net.Http后,运行页面还是报错 using System.Net.Http;chatGPT解释需要安装Microsoft.AspNet.WebApi.Client包,IIS安装的包文件存储在bin目录下,安装包后bin目录多出了一些列文件 运行页面后还是报错 需要在web.config配置文件…

【ISO14229_UDS刷写】-6-$34,$35,$36,$37诊断服务用于downloading下载/uploading上载数据的消息流示例

总目录:(单击下方链接皆可跳转至专栏总目录) 《UDS/OBD诊断需求编辑工具》总目录https://blog.csdn.net/qfmzhu/article/details/123697014【ISO14229_UDS刷写】-1-$34诊断服务RequestDownload理论部分https://blog.csdn.net/qfmzhu/article…

VM600 CPUR2 机架控制器和通信接口卡

VM600 CPUR2和IOCR2机架控制器和通信接口卡对是一个中央处理器(CPU)卡对,作为Meggitt振动计产品线中VM600机架机械保护系统(MPS)和/或状态监控系统(CMS)的系统控制器和数据通信网关。 注:不同版本的CPUx/IOCx框架控制器和通信接口卡对可用,如下所示: C…

初探 Compose for Wear OS:实现一个简易选择APP

前言 俗话说,人生有三大难题:早上吃啥、中午吃啥、晚上吃啥。 这个问题一度困扰着无数的人,直到一款帮你选择吃什么的神器《今天吃啥》出现,人们再也不用为了每天吃啥而犯愁了。 哈哈,以上纯属抖机灵。 最近访问谷歌…

Spring MVC数据绑定和响应

数据绑定 在程序运行时,Spring MVC接收到客户端的请求后,会根据客户端请求的参数和请求头等数据信息,将参数以特定的方式转换并绑定到处理器的形参中。Spring MVC中将请求消息数据与处理器的形参建立连接的过程就是Spring MVC的数据绑定。 …

电力监控系统在中原科技城智慧能源配电工程中的应用

摘 要:随着社会经济的快速发展,我国变电站正朝着现代化的方向不断发展,自动化设备以及继电保护装置凭借自身优异性能而获得广泛应用。本文介绍的AM5SE系列的微机保护装置,可以针对中原科技城智慧能源配电工程中不同保护对象提供对…

真无线蓝牙耳机什么牌子好?盘点五款质量好的蓝牙耳机

相信很多人都有过这样的经历,早 晚高峰像沙丁鱼般被挤在公交或地铁上,嘈杂的环境、工作的劳累让你只想听听音乐追追剧,给自己一些放松的时光。可拿出有线耳机却常常被挤掉,更有被扯到耳朵的时候。想换一款蓝牙耳机,但面…

Kubernetes 证书详解

K8S 证书介绍 在 Kube-apiserver 中提供了很多认证方式,其中最常用的就是 TLS 认证,当然也有 BootstrapToken,BasicAuth 认证等,只要有一个认证通过,那么 Kube-apiserver 即认为认证通过。下面就主要讲解 TLS 认证。 …

chatgpt赋能python:Python动作捕捉:何为动作捕捉及其应用

Python动作捕捉:何为动作捕捉及其应用 介绍 动作捕捉是一种技术,可将人或物体的运动转换为数字形式。在过去的几十年里,动作捕捉已被广泛应用于电影制作、游戏开发、医学研究等领域。 Python是一种功能强大的编程语言,已成为许…

AI 工具分享第 4 期:13 款国外免费AI视频生成工具

0. 未来百科 未来百科,是一个知名的AI产品导航网站 —— 为发现全球优质AI工具而生 。目前已 聚集全球3000优质AI工具产品 ,旨在帮助用户发现全球最好的AI工具,同时为研发AI垂直应用的创业公司提供展示窗口,迎接未来的AI时代。未来…

Revit建模|Revit风管怎么绘制?

​绘制风管是机电工程重要的一环,对于不少刚接触Revit的小伙伴来说似乎还无从下手,今天就让小编来告诉大家在Revit中绘制风管的方法。 一、在Revit绘制风管 第一步:首先我们先在revit的界面中项目文件找到风管。 第二步:打开后我…

医疗IT系统安科瑞隔离电源装置在医院的应用

【摘要】介绍该三级综合医院采用安科瑞隔离电源系统5件套,使用落地式配电柜安装方式,从而实现将TN系统转化为IT系统,以及系统绝缘情况监测。 【关键词】医用隔离电源系统;IT系统;绝缘情况监测;三级综合医院…

tektronix泰克TDS3054数字荧光示波器

tektronix TDS3054是泰克TDS3000系列示波器,它是一种新的图形界面操作模式,称为QuickMenu。这种快速访问的用户界面使得主要的示波器控制访问一个单一的按键。每一个示波器都包含一个在示波器中运行的在线巡览盘。此磁盘提供了产品的操作和功能的概述。 …

Qt与Excel:从底层原理到上层应用的全面探索

Qt与Excel:从底层原理到上层应用的全面探索 一、Qt与Excel文件的交互基础(Basics of Qt and Excel Interaction)1.1 Qt与Excel文件的基本概念(Basic Concepts of Qt and Excel Files)1.2 Qt读取Excel文件的基本方法&am…