使用朴素贝叶斯过滤垃圾邮件
文章目录
- 使用朴素贝叶斯过滤垃圾邮件
- 流程
- 准备数据
- 构建训练集和测试集
- 分类电子邮件
- 完整测试代码:
- 欢迎关注公众号【三戒纪元】
朴素贝叶斯的最著名的应用:过滤电子邮件垃圾。
流程
- 收集数据:提供文本文件
- 准备数据:将文本文件解析成词条向量
- 分析数据:检查词条确保解析的正确性
- 训练算法:使用 trainNB0() 函数
- 测试算法:使用 classifyNB(),并且构建一个新的测试函数就散文档集的错误率
- 使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
准备数据
使用正则表达式切分数据,将邮件内容拆分成单词向量,并转换为小写
def textParse(bigString): #input is big string, #output is word list
import re
listOfTokens = re.split(r'\\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
构建训练集和测试集
两个集合中的邮件都是随机选出的,本例有50封邮件,其中10封电子邮件被随机选择为测试集。
分类器中的概率计算只利用训练集中的文档完成。
docList=[]; classList = []; fullText =[]
# 读取邮件内容,创建词条向量
for i in range(1, 26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
# 测试集
trainingSet = list(range(50)); testSet=[] #create test set
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
# 训练集
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
选择出的数字所对应的文档被添加到测试集,同时也将其从训练集中删除。
这种随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程成为留存交叉验证(hold-out cross validation)
分类电子邮件
遍历训练集的所有文档,对每封邮件基于词汇表并使用setOfWords2Vec() 函数构建词向量。
这些词在 trainNB0() 函数中用于计算分类所需的概率。
遍历数据测试集,对每封电子邮件进行分类。
如果邮件分类错误,则错误数 +1 ,然后给出总的错误百分比
完整测试代码:
def textParse(bigString): #input is big string, #output is word list
import re
listOfTokens = re.split(r'\\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1, 26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
trainingSet = list(range(50)); testSet=[] #create test set
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print ("classification error",docList[docIndex])
err_rate = float(errorCount)/len(testSet)
print ('the error rate is: ', err_rate)
#return vocabList,fullText
return err_rate
为了更精确地估计分类器的错误率,就应该进行多次迭代后求出平均错误率。
print("******* 文件解析及完整的垃圾邮件测试函数 spamTest ******* ")
error_rate = 0.0
for i in range(10):
error_rate += spamTest()
print("average error rate: {}".format(error_rate/10.0))
结果:
classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
the error rate is: 0.6
classification error ['ryan whybrew commented on your status.\n\nryan wrote:\n"turd ferguson or butt horn."\n']
classification error ["we saw this on the way to the coast...thought u might like it\n\nhangzhou is huge, one day wasn't enough, but we got a glimpse...\n\nwe went inside the china pavilion at expo, it is pretty interesting,\neach province has an exhibit..."]
classification error ["i've thought about this and think it's possible. we should get another\nlunch. i have a car now and could come pick you up this time. does\nthis wednesday work? 11:50?\n\ncan i have a signed copy of you book?"]
classification error ["\nscifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new nvidia fermi-class tesla 20-series gpu.\n\nscifinance® is a derivatives pricing and risk model development tool that automatically generates c/c++ and gpu-enabled source code from concise, high-level model specifications. no parallel computing or cuda programming expertise is required.\n\nscifinance's automatic, gpu-enabled monte carlo pricing model source code generation capabilities have been significantly extended in the latest release. this includes:\n\n"]
classification error ['hi peter,\n\nwith jose out of town, do you want to\nmeet once in a while to keep things\ngoing and do some interesting stuff?\n\nlet me know\neugene']
classification error ["yo. i've been working on my running website. i'm using jquery and the jqplot plugin. i'm not too far away from having a prototype to launch. \n\nyou used jqplot right? if not, i think you would like it."]
classification error ['ok i will be there by 10:00 at the latest.']
the error rate is: 0.7
the error rate is: 0.0
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['buyviagra 25mg, 50mg, 100mg,\nbrandviagra, femaleviagra from $1.15 per pill\n\n\nviagranoprescription needed - from certified canadian pharmacy\n\nbuy here... we accept visa, amex, e-check... worldwide delivery']
the error rate is: 0.2
classification error ['benoit mandelbrot 1924-2010\n\nbenoit mandelbrot 1924-2010\n\nwilmott team\n\nbenoit mandelbrot, the mathematician, the father of fractal mathematics, and advocate of more sophisticated modelling in quantitative finance, died on 14th october 2010 aged 85.\n\nwilmott magazine has often featured mandelbrot, his ideas, and the work of others inspired by his fundamental insights.\n\nyou must be logged on to view these articles from past issues of wilmott magazine.']
classification error ['that is cold. is there going to be a retirement party? \nare the leaves changing color?']
classification error ['zach hamm commented on your status.\n\nzach wrote:\n"doggy style - enough said, thank you & good night"\n\n\n']
classification error ["hi peter,\n\nthese are the only good scenic ones and it's too bad there was a girl's back in one of them. just try to enjoy the blue sky : ))\n\nd"]
classification error ["hi hommies,\n\njust got a phone call from the roofer, they will come and spaying the foaming today. it will be dusty. pls close all the doors and windows.\ncould you help me to close my bathroom window, cat window and the sliding door behind the tv?\ni don't know how can those 2 cats survive......\n\nsorry for any inconvenience!"]
classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
the error rate is: 0.6
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['experience with biggerpenis today! grow 3-inches more\n\nthe safest & most effective methods of_penisen1argement.\nsave your time and money!\nbettererections with effective ma1eenhancement products.\n\n#1 ma1eenhancement supplement. trusted by millions. buy today!']
the error rate is: 0.2
classification error ['a home based business opportunity is knocking at your door.\n\ndon\x92t be rude and let this chance go by.\n\nyou can earn a great income and find\nyour financial life transformed.\n\nlearn more here.\n\n\n\nto your success.\n\nwork from home finder experts\n']
classification error ['ordercializviagra online & save 75-90%\n\n0nline pharmacy noprescription required\nbuy canadian drugs at wholesale prices and save 75-90%\nfda-approved drugs + superb quality drugs only!\naccept all major credit cards']
classification error ['codeine (the most competitive price on net!)\n\ncodeine (wilson) 30mg x 30 $156.00\ncodeine (wilson) 30mg x 60 $291.00 (+4 freeviagra pills)\ncodeine (wilson) 30mg x 90 $396.00 (+4 freeviagra pills)\ncodeine (wilson) 30mg x 120 $492.00 (+10 freeviagra pills)']
the error rate is: 0.3
classification error ["yo. i've been working on my running website. i'm using jquery and the jqplot plugin. i'm not too far away from having a prototype to launch. \n\nyou used jqplot right? if not, i think you would like it."]
classification error ['hi peter,\n\n sure thing. sounds good. let me know what time would be good for you.\ni will come prepared with some ideas and we can go from there.\n\nregards,\n\n-vivek.']
classification error ["\nscifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new nvidia fermi-class tesla 20-series gpu.\n\nscifinance® is a derivatives pricing and risk model development tool that automatically generates c/c++ and gpu-enabled source code from concise, high-level model specifications. no parallel computing or cuda programming expertise is required.\n\nscifinance's automatic, gpu-enabled monte carlo pricing model source code generation capabilities have been significantly extended in the latest release. this includes:\n\n"]
classification error ['linkedin\n\njulius o requested to add you as a connection on linkedin:\n\nhi peter.\n\nlooking forward to the book!\n\n \naccept \tview invitation from julius o\n']
classification error ["thanks peter.\n\ni'll definitely check in on this. how is your book\ngoing? i heard chapter 1 came in and it was in \ngood shape. ;-)\n\ni hope you are doing well.\n\ncheers,\n\ntroy"]
classification error ['jay stepp commented on your status.\n\njay wrote:\n""to the" ???"\n\n\nreply to this email to comment on this status.\n\nto see the comment thread, follow the link below:\n\n']
classification error ['this e-mail was sent from a notification-only address that cannot accept incoming e-mail. please do not reply to this message.\n\nthank you for your online reservation. the store you selected has located the item you requested and has placed it on hold in your name. please note that all items are held for 1 day. please note store prices may differ from those online.\n\nif you have questions or need assistance with your reservation, please contact the store at the phone number listed below. you can also access store information, such as store hours and location, on the web at http://www.borders.com/online/store/storedetailview_98.']
the error rate is: 0.7
classification error ['zach hamm commented on your status.\n\nzach wrote:\n"doggy style - enough said, thank you & good night"\n\n\n']
classification error ['hi peter,\n \nthe hotels are the ones that rent out the tent. they are all lined up on the hotel grounds : )) so much for being one with nature, more like being one with a couple dozen tour groups and nature.\ni have about 100m of pictures from that trip. i can go through them and get you jpgs of my favorite scenic pictures.\n \nwhere are you and jocelyn now? new york? will you come to tokyo for chinese new year? perhaps to see the two of you then. i will go to thailand for winter holiday to see my mom : )\n \ntake care,\nd\n']
classification error ["what is going on there?\ni talked to john on email. we talked about some computer stuff that's it.\n\ni went bike riding in the rain, it was not that cold.\n\nwe went to the museum in sf yesterday it was $3 to get in and they had\nfree food. at the same time was a sf giants game, when we got done we\nhad to take the train with all the giants fans, they are 1/2 drunk."]
classification error ['that is cold. is there going to be a retirement party? \nare the leaves changing color?']
classification error ['there was a guy at the gas station who told me that if i knew mandarin\nand python i could get a job with the fbi.']
classification error ["linkedin\n\nkerry haloney requested to add you as a connection on linkedin:\n\npeter,\n\ni'd like to add you to my professional network on linkedin.\n\n- kerry haloney\n \n"]
the error rate is: 0.6
average error rate: 0.45
Process finished with exit code 0