TF-IDF在现代搜索引擎优化策略中的作用

news2026/2/12 15:43:39

TF-IDF（Term Frequency-Inverse Document Frequency）是一种用于文本挖掘和信息检索的统计方法，用来评估一个词语对于一个文档或一个语料库的重要程度。TF-IDF算法结合了词频（TF）和逆文档频率（IDF）两个指标，既考虑了词语在单个文档中的出现频率，也考虑了词语在整个语料库中的普遍性。

1. 术语解释

1.1 词频（TF）

词频（Term Frequency）是指一个词在文档中出现的频率。如果一个词经常出现，它就一定很重要，对吗？并非总是如此！像 “and”、"the "和 "is "这样的词在英语中经常出现，但它们并不能说明文档的内容。这就是 IDF 的作用所在。

tf(t,d) = count of t in d / number of words in d

1.2 逆文档频率（IDF）

逆文档频率（Inverse Document Frequency）用来衡量词语在整个语料库中的普遍性。词语出现得越频繁，其信息量越小，反之，出现得越少，其信息量越大。

在这里插入图片描述

2. TF-IDF计算

TF-IDF值是TF和IDF的乘积，用来衡量词语的重要性。公式如下：

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

3. 示例

假设我们有如下三个文档：

文档1：我喜欢看电影
文档2：我不喜欢看电影
文档3：我喜欢看书

首先，我们计算每个文档中词语的词频（TF）：

词语	文档1 TF	文档2 TF	文档3 TF
我	1/4	1/4	1/4
喜欢	1/4	1/4	1/4
看	1/4	1/4	1/4
电影	1/4	1/4	0
不	0	1/4	0
书	0	0	1/4

接着，我们计算每个词语的逆文档频率（IDF）：

词语	出现文档数	IDF
我	3	$log⁡(3/3)=0\log(3/3) = 0$
喜欢	3	$log⁡(3/3)=0\log(3/3) = 0$
看	3	$log⁡(3/3)=0\log(3/3) = 0$
电影	2	$log⁡(3/2)≈0.176\log(3/2) \approx 0.176$
不	1	$log⁡(3/1)≈1.098\log(3/1) \approx 1.098$
书	1	$log⁡(3/1)≈1.098\log(3/1) \approx 1.098$

最后，我们计算TF-IDF值：

词语	文档1 TF-IDF	文档2 TF-IDF	文档3 TF-IDF
我	0	0	0
喜欢	0	0	0
看	0	0	0
电影	$1/4 \times 0.176 \approx 0.044$	$1/4 \times 0.176 \approx 0.044$	0
不	0	$1/4 \times 1.098 \approx 0.275$	0
书	0	0	$1/4 \times 1.098 \approx 0.275$

4. 代码

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> corpus = ['this is the first document',
...          'this document is the second document',
...          'and this is the third one',
...          'is this the first document']
>>> vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
...              'and', 'one']
>>> pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
...                 ('tfid', TfidfTransformer())]).fit(corpus)
>>> pipe['count'].transform(corpus).toarray()
array([[1, 1, 1, 1, 0, 1, 0, 0],
   [1, 2, 0, 1, 1, 1, 0, 0],
   [1, 0, 0, 1, 0, 1, 1, 1],
   [1, 1, 1, 1, 0, 1, 0, 0]])
>>> pipe['tfid'].idf_
array([1.        , 1.22314355, 1.51082562, 1.        , 1.91629073,
   1.        , 1.91629073, 1.91629073])
>>> pipe.transform(corpus).shape
(4, 8)

参考 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html