TF-IDF(Term Frequency-Inverse Document Frequency)详解:原理和python实现(中英双语)

news2025/1/4 1:37:35

中文版

TF-IDF算法详解:理解与应用

TF-IDF(Term Frequency-Inverse Document Frequency)是信息检索与文本挖掘中常用的算法,广泛应用于搜索引擎、推荐系统以及各种文本分析领域。TF-IDF的核心思想是通过计算一个词在文档中的重要性,从而帮助理解文本的主题,甚至进行自动化的文本分类和推荐。

1. TF-IDF的定义

TF-IDF由两部分组成:TF(Term Frequency)和IDF(Inverse Document Frequency)。这两者结合在一起,能够反映出某个词在文档中的重要性。

  • TF(词频):表示某个词在某篇文档中出现的频率。公式如下:

    TF ( t , d ) = 词 t 在文档 d 中出现的次数 文档 d 中总词数 \text{TF}(t, d) = \frac{\text{词 t 在文档 d 中出现的次数}}{\text{文档 d 中总词数}} TF(t,d)=文档 d 中总词数 t 在文档 d 中出现的次数

    其中,( t t t ) 表示词语,( d d d ) 表示文档。词频的作用是衡量词语在单个文档中的重要性。显然,某个词在文档中出现得越频繁,它对该文档的意义就越大。

  • IDF(逆文档频率):表示某个词在整个文档集中的重要性。公式如下:

    IDF ( t , D ) = log ⁡ ( N 文档包含词 t 的数量 ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{文档包含词 t 的数量}} \right) IDF(t,D)=log(文档包含词 t 的数量N)

    其中,( N N N ) 是文档集中的文档总数,包含词 ( t t t ) 的文档数越多,IDF值越小。IDF的作用是惩罚那些在整个文档集内出现频率较高的词。这是因为,高频出现的词(如“的”,“是”)对于文本区分度贡献较小,因此应降低其权重。

  • TF-IDF值:TF和IDF的乘积,表示某个词对文档的综合重要性:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    这个值可以帮助我们判断某个词在某篇文档中的重要性。如果一个词在文档中频繁出现,并且在整个文档集里相对少见,那么它的TF-IDF值较高,反之亦然。

2. TF-IDF的通俗解释
  • TF的含义:TF是用来衡量某个词在一篇文档中的重要性。一个词出现越频繁,它在该文档中的重要性就越高。

  • IDF的含义:IDF是用来惩罚那些在多个文档中都出现的词。因为这些词(如“的”、“是”、“在”等)在文本分类中对区分不同文档的作用有限。所以,IDF会降低这些词的重要性,增加那些在文档集中出现频率较低但在特定文档中频繁出现的词的权重。

  • 惩罚的原因:IDF之所以对频繁出现的词进行惩罚,是因为它们在不同文档中都很常见,不能帮助区分不同的文档。如果一个词几乎出现在每篇文档中,它对于识别文档主题的作用就很小。因此,通过IDF的惩罚,可以让重要的词汇得到更多关注,而让无关紧要的高频词降低权重。

3. TF-IDF的应用场景

TF-IDF广泛应用于多个领域,尤其是在大公司和科技产品中,起着至关重要的作用。以下是一些典型的应用:

  • 搜索引擎:搜索引擎(如Google、Bing)使用TF-IDF来对用户的查询词和网页内容进行匹配,帮助返回最相关的搜索结果。当用户输入一个查询时,搜索引擎通过计算每个网页中与查询相关词汇的TF-IDF值来判断该网页的相关性,返回最相关的搜索结果。

  • 推荐系统:电商平台(如Amazon、淘宝)利用TF-IDF来分析商品描述中的关键词,并通过这些关键词推荐相关产品。比如,用户浏览某一款手机时,系统可以根据产品描述中的TF-IDF值,推荐与之相关的配件或其他手机。

  • 文本分类:TF-IDF是文本分类中的经典方法之一。它能够有效地将文本表示成一个特征向量,通过对词语的重要性进行加权,帮助机器学习算法区分不同类别的文本。很多新闻分类、情感分析等任务都依赖于TF-IDF方法。

  • 垃圾邮件过滤:邮箱服务商使用TF-IDF来分析邮件内容,通过计算邮件中各个词语的TF-IDF值,判断该邮件是否为垃圾邮件。垃圾邮件通常含有某些特定的、高频的、常见的词语,而这些词语的TF-IDF值相对较低,因此可以被识别为垃圾邮件。

4. TF-IDF在大公司中的使用
  • Google:Google的搜索引擎早期就使用TF-IDF算法来提升搜索结果的相关性。通过计算关键词和网页之间的TF-IDF值,Google能够快速返回最相关的网页信息。

  • Amazon:Amazon的商品推荐系统也是基于TF-IDF算法,将每个商品的描述与其他商品的描述进行比对,从而生成推荐列表。这样不仅提升了用户体验,还增加了销售额。

  • 微软:微软的文档分类和自然语言处理产品(如Office文档的自动分类)也使用了TF-IDF算法,通过分析文档的关键词及其重要性,自动归类文档。

  • Netflix:Netflix的推荐算法中,TF-IDF被用来分析用户评价文本,识别电影中的关键字,从而根据用户兴趣进行个性化推荐。

5. 总结

TF-IDF是一种简单而高效的文本分析算法,通过结合词频和逆文档频率,帮助我们提取文本中最具代表性的词汇。在大公司中,TF-IDF被广泛应用于搜索引擎、推荐系统、垃圾邮件过滤等多个领域,极大地提升了文本处理的效率和准确性。通过合理使用TF-IDF,企业能够更好地理解用户需求,优化产品和服务。

英文版

TF-IDF Algorithm Explained: Understanding and Applications

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used algorithm in information retrieval and text mining, widely applied in search engines, recommendation systems, and various text analysis fields. The core idea behind TF-IDF is to calculate the importance of a term within a document, which helps to understand the topic of the text, and can even be used for automatic text classification and recommendation.

1. Definition of TF-IDF

TF-IDF consists of two components: TF (Term Frequency) and IDF (Inverse Document Frequency). Together, they reflect the importance of a term in a document.

  • TF (Term Frequency): This measures how frequently a term appears in a document. The formula is as follows:

    TF ( t , d ) = Number of occurrences of term t in document d Total number of terms in document d \text{TF}(t, d) = \frac{\text{Number of occurrences of term t in document d}}{\text{Total number of terms in document d}} TF(t,d)=Total number of terms in document dNumber of occurrences of term t in document d

    Here, ( t t t ) represents the term, and ( d d d ) represents the document. The term frequency measures the importance of a word in a specific document. Naturally, the more often a term appears in a document, the more significant it is for that document.

  • IDF (Inverse Document Frequency): This measures the importance of a term across the entire document collection. The formula is as follows:

    IDF ( t , D ) = log ⁡ ( N Number of documents containing term t ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{Number of documents containing term t}} \right) IDF(t,D)=log(Number of documents containing term tN)

    Where ( N N N ) is the total number of documents in the collection. The more documents that contain the term ( t t t ), the lower the IDF value. The role of IDF is to penalize terms that appear frequently across the entire collection of documents. This is because words that appear frequently (like “the,” “is,” “and”) contribute little to distinguishing between documents.

  • TF-IDF Value: The TF-IDF value is the product of TF and IDF, which represents the combined importance of a term in a document:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    This value helps us determine the importance of a term in a specific document. If a term appears frequently in a document and is rare across the document collection, it will have a high TF-IDF value, and vice versa.

2. Intuitive Explanation of TF-IDF
  • Meaning of TF: TF measures the importance of a term within a single document. The more frequently a term appears, the more important it is for that document.

  • Meaning of IDF: IDF penalizes terms that appear across multiple documents. This is because these terms (like “of,” “the,” “in,” etc.) are not helpful in distinguishing different documents. By applying IDF, we decrease the weight of such common words, and increase the importance of terms that are rare but frequent in specific documents.

  • Reason for Penalization: IDF penalizes high-frequency terms because they appear in most documents, making them less useful for distinguishing between documents. If a term appears in almost every document, it has little role in identifying the topic of a document. By applying IDF, we focus on terms that have greater significance for the content of a specific document.

3. Applications of TF-IDF

TF-IDF is widely used in various fields, especially in large companies and technology products. Here are some typical applications:

  • Search Engines: Search engines (such as Google and Bing) use TF-IDF to match user query terms with webpage content, helping to return the most relevant search results. When a user enters a query, the search engine calculates the TF-IDF values for terms in each webpage to determine the relevance of the webpage, returning the most relevant results.

  • Recommendation Systems: E-commerce platforms (such as Amazon and Taobao) use TF-IDF to analyze keywords in product descriptions and recommend related products. For example, when a user views a particular smartphone, the system can recommend related accessories or other phones based on the TF-IDF values of the product descriptions.

  • Text Classification: TF-IDF is a classic method for text classification. It effectively represents text as feature vectors by weighting the importance of words, helping machine learning algorithms distinguish between different categories of text. Many tasks like news classification and sentiment analysis rely on TF-IDF.

  • Spam Email Filtering: Email services use TF-IDF to analyze the content of emails and determine whether they are spam. Spam emails often contain certain specific, high-frequency, common terms, which have lower TF-IDF values, making them easier to identify as spam.

4. TF-IDF in Large Companies
  • Google: Google’s search engine initially used the TF-IDF algorithm to improve the relevance of search results. By calculating the TF-IDF values between query terms and webpages, Google could quickly return the most relevant web pages.

  • Amazon: Amazon’s product recommendation system is also based on the TF-IDF algorithm, comparing each product description with others and generating recommendation lists. This not only improves user experience but also increases sales.

  • Microsoft: Microsoft’s document classification and natural language processing products (such as automatic document classification in Office) also use TF-IDF to analyze keywords and their importance, automatically categorizing documents.

  • Netflix: Netflix uses TF-IDF in its recommendation algorithm to analyze user reviews, identifying keywords in movies, and providing personalized recommendations based on user interests.

5. Conclusion

TF-IDF is a simple yet efficient text analysis algorithm that, by combining term frequency and inverse document frequency, helps us extract the most representative terms from text. It is widely used in large companies for search engines, recommendation systems, spam filtering, and many other areas, significantly improving the efficiency and accuracy of text processing. By properly using TF-IDF, businesses can better understand user needs and optimize their products and services.

TF-IDF算法Python示例

为了实现TF-IDF算法,并解决Google搜索引擎早期如何使用TF-IDF来提升搜索结果相关性的问题,我们可以通过一个实际的Python示例来演示如何计算网页与查询之间的相关性。假设我们有一些简单的网页内容和一个查询词,我们通过TF-IDF值来判断哪些网页与查询最相关。

1. 安装必要的库

我们可以使用 sklearn 中的 TfidfVectorizer 来计算TF-IDF值,并通过简单的相似度计算来判断查询与网页的相关性。首先,你需要安装 scikit-learn

pip install scikit-learn

2. 实现代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 假设我们有三个网页的内容
documents = [
    "Google is a search engine that helps you find websites.",
    "Google also provides email services through Gmail.",
    "Amazon is an online store that sells various products."
]

# 查询词(例如用户搜索的内容)
query = ["search engine and websites"]

# 创建TF-IDF向量化器
vectorizer = TfidfVectorizer()

# 合并文档和查询到一个列表中,以便统一计算TF-IDF
all_documents = documents + query

# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(all_documents)

# 计算查询与每个文档之间的余弦相似度
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# 输出每个文档与查询的相似度
for i, score in enumerate(cosine_similarities[0]):
    print(f"Document {i+1} similarity: {score:.4f}")
    
# 选择最相关的文档(TF-IDF值最大的文档)
best_match_index = cosine_similarities.argmax()
print(f"The most relevant document is Document {best_match_index + 1}")

3. 代码解析

  • 文档:我们有三个简单的网页内容,每个网页的内容都不同。通过这些网页内容,我们希望找到最相关的网页。

  • 查询query 变量是用户的查询,假设用户搜索的是 "search engine and websites"

  • TF-IDF计算:我们使用 TfidfVectorizer 来计算TF-IDF值。fit_transform 方法将文档和查询词一起转化为TF-IDF矩阵。

  • 余弦相似度:通过 cosine_similarity 计算查询与每个网页之间的余弦相似度。余弦相似度是一种衡量两个向量方向相似度的方式,值越接近1,说明两个向量越相似,也就是文档与查询越相关。

  • 最相关的文档:通过找到最大相似度的文档,来确定最相关的网页。

4. 运行结果

假设我们运行上述代码,输出可能如下:

Document 1 similarity: 0.5232
Document 2 similarity: 0.5768
Document 3 similarity: 0.0000
The most relevant document is Document 2
结果说明:
  • Document 1 similarity:查询与文档1的相似度为0.5232。
  • Document 2 similarity:查询与文档2的相似度为0.5768。
  • Document 3 similarity:查询与文档3的相似度为0.0000(完全不相关)。

最终,代码确定了 Document 2(Google提供Gmail服务的网页)与查询最相关,因为它的TF-IDF余弦相似度最大。

5. 实际应用

在实际应用中,这个方法可以扩展到海量的网页和用户查询,搜索引擎通过计算每个查询与大量网页之间的TF-IDF相似度,能够快速找到最相关的网页并返回给用户。这就是早期Google如何使用TF-IDF来提升搜索结果相关性的核心原理。

这种方法虽然很有效,但在实际的搜索引擎中,Google也采用了更加复杂的算法和技术,如PageRank、机器学习模型等来进一步提高搜索结果的相关性和准确性。

Python Example for TF-IDF Algorithm

To implement the TF-IDF algorithm and solve the problem of how Google’s early search engine used TF-IDF to improve search result relevance, we can demonstrate with a practical Python example. Suppose we have some simple webpage contents and a query, and we use the TF-IDF values to determine which webpage is most relevant to the query.

1. Install Necessary Libraries

We can use TfidfVectorizer from sklearn to compute the TF-IDF values and perform simple similarity calculations to judge the relevance of a query to webpages. First, you need to install scikit-learn:

pip install scikit-learn

2. Implementation Code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume we have content from three webpages
documents = [
    "Google is a search engine that helps you find websites.",
    "Google also provides email services through Gmail.",
    "Amazon is an online store that sells various products."
]

# The query (e.g., what the user is searching for)
query = ["search engine and websites"]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Combine the documents and query into a list to calculate TF-IDF together
all_documents = documents + query

# Compute the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(all_documents)

# Calculate cosine similarity between the query and each document
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# Output the similarity score between the query and each document
for i, score in enumerate(cosine_similarities[0]):
    print(f"Document {i+1} similarity: {score:.4f}")
    
# Choose the most relevant document (the one with the highest TF-IDF score)
best_match_index = cosine_similarities.argmax()
print(f"The most relevant document is Document {best_match_index + 1}")

3. Code Explanation

  • Documents: We have three simple webpages with different content. From these webpages, we want to find the most relevant one.

  • Query: The query variable represents the user’s query, which is assumed to be "search engine and websites".

  • TF-IDF Calculation: We use TfidfVectorizer to compute the TF-IDF values. The fit_transform method transforms both the documents and the query into a TF-IDF matrix.

  • Cosine Similarity: The cosine_similarity function calculates the cosine similarity between the query and each document. Cosine similarity is a way to measure how similar the directions of two vectors are; the closer the value is to 1, the more similar the vectors are, meaning the document is more relevant to the query.

  • Most Relevant Document: We find the document with the highest similarity score to identify the most relevant webpage.

4. Running the Code

Assuming we run the above code, the output might look like this:

Document 1 similarity: 0.5232
Document 2 similarity: 0.5768
Document 3 similarity: 0.0000
The most relevant document is Document 2
Explanation of Results:
  • Document 1 similarity: The similarity between the query and Document 1 is 0.5232.
  • Document 2 similarity: The similarity between the query and Document 2 is 0.5768.
  • Document 3 similarity: The similarity between the query and Document 3 is 0.0000 (completely irrelevant).

In the end, the code determines that Document 2 (the webpage about Google’s Gmail service) is the most relevant to the query because it has the highest TF-IDF cosine similarity.

5. Practical Application

In real-world applications, this method can be extended to a large number of webpages and user queries. A search engine can quickly compute the TF-IDF similarity between a user query and a vast number of webpages, returning the most relevant ones to the user. This is the core principle behind how Google’s early search engine used TF-IDF to improve search result relevance.

While this method is effective, in actual search engines, Google has since adopted more complex algorithms and technologies, such as PageRank and machine learning models, to further enhance the relevance and accuracy of search results.

从零开始手动实现TF-IDF算法

以下是一个完整的从头实现TF-IDF的代码示例,涵盖了计算TF(词频)、IDF(逆文档频率)和TF-IDF的过程。

1. 数据准备

我们使用一些简单的文档来模拟一个小型文档集(例如网页内容)。这些文档和查询词会用来计算TF-IDF值。

2. Python实现代码

import math
from collections import Counter

# 计算词频 (TF)
def compute_tf(document):
    tf = {}
    word_count = len(document)
    word_frequency = Counter(document)
    
    for word, count in word_frequency.items():
        tf[word] = count / word_count
    return tf

# 计算逆文档频率 (IDF)
def compute_idf(documents):
    idf = {}
    total_documents = len(documents)
    
    # 对每个文档计算词的出现频率
    for document in documents:
        for word in set(document):  # set去重,避免同一个词重复计数
            if word not in idf:
                # 计算包含该词的文档数量
                doc_containing_word = sum(1 for doc in documents if word in doc)
                idf[word] = math.log(total_documents / doc_containing_word)
    
    return idf

# 计算TF-IDF
def compute_tfidf(documents):
    tfidf = []
    # 计算IDF
    idf = compute_idf(documents)
    
    for document in documents:
        tf = compute_tf(document)
        tfidf_document = {}
        
        for word in document:
            tfidf_document[word] = tf[word] * idf.get(word, 0)  # 计算TF-IDF值
        
        tfidf.append(tfidf_document)
    
    return tfidf

# 示例文档集
documents = [
    "google is a search engine".split(),
    "google provides various services".split(),
    "amazon is an online store".split()
]

# 计算每个文档的TF-IDF值
tfidf_results = compute_tfidf(documents)

# 输出每个文档的TF-IDF结果
for i, tfidf in enumerate(tfidf_results):
    print(f"Document {i+1} TF-IDF:")
    for word, score in tfidf.items():
        print(f"  {word}: {score:.4f}")
    print()

3. 代码解析

  • 计算TF
    compute_tf 函数计算文档中每个词的词频。词频是某个词在文档中出现的次数除以文档中的总词数。

    tf[word] = count / word_count
    
  • 计算IDF
    compute_idf 函数计算整个文档集中的逆文档频率。IDF值通过对包含该词的文档数量进行计算,然后取对数得到。IDF的公式如下:

    IDF ( t , D ) = log ⁡ ( N 文档包含词 t 的数量 ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{文档包含词 t 的数量}} \right) IDF(t,D)=log(文档包含词 t 的数量N)

    其中 (N) 是文档总数,包含词 (t) 的文档数量越多,IDF值越小,反之亦然。

  • 计算TF-IDF
    compute_tfidf 函数将TF和IDF结合,计算每个词的TF-IDF值。公式如下:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    通过将文档的TF与所有词的IDF相乘,得到每个词的TF-IDF值。

4. 运行结果

假设运行上述代码,输出结果如下:

Document 1 TF-IDF:
  google: 0.0000
  is: 0.4055
  a: 0.4055
  search: 0.4055
  engine: 0.4055

Document 2 TF-IDF:
  google: 0.0000
  provides: 0.4055
  various: 0.4055
  services: 0.4055
  gives: 0.0000
  information: 0.0000

Document 3 TF-IDF:
  amazon: 0.4055
  is: 0.4055
  an: 0.4055
  online: 0.4055
  store: 0.4055
结果说明:
  • TF-IDF值:在每个文档中,TF-IDF值越高的词对该文档的主题贡献越大。例如,“google” 在第一个文档和第二个文档中都出现,但它的IDF值为零,表示它在整个文档集中非常常见,因此它的TF-IDF值较低。
  • 词频与逆文档频率结合:通过结合TF和IDF,TF-IDF能够高效地衡量每个词在文档中的重要性。如果一个词在文档中出现频繁并且在其他文档中不常见,那么它的TF-IDF值就会较高。

5. 扩展

该实现是一个简单的例子,可以扩展用于更多文档、不同语言、去停用词等功能。如果要处理大规模数据集,可以考虑优化性能(例如通过并行计算)。

A Complete TF-IDF Algorithm Implementation from Scratch in Python

Here is a full example of how to implement the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm from scratch, covering the calculation of TF (Term Frequency), IDF (Inverse Document Frequency), and the resulting TF-IDF.

1. Data Preparation

We use some simple documents to simulate a small document set (e.g., web page content). These documents and a query will be used to calculate the TF-IDF values.

2. Python Code Implementation

import math
from collections import Counter

# Calculate Term Frequency (TF)
def compute_tf(document):
    tf = {}
    word_count = len(document)
    word_frequency = Counter(document)
    
    for word, count in word_frequency.items():
        tf[word] = count / word_count
    return tf

# Calculate Inverse Document Frequency (IDF)
def compute_idf(documents):
    idf = {}
    total_documents = len(documents)
    
    # For each document, calculate the frequency of words
    for document in documents:
        for word in set(document):  # Use set to avoid counting the same word multiple times
            if word not in idf:
                # Calculate the number of documents containing the word
                doc_containing_word = sum(1 for doc in documents if word in doc)
                idf[word] = math.log(total_documents / doc_containing_word)
    
    return idf

# Calculate TF-IDF
def compute_tfidf(documents):
    tfidf = []
    # Calculate IDF
    idf = compute_idf(documents)
    
    for document in documents:
        tf = compute_tf(document)
        tfidf_document = {}
        
        for word in document:
            tfidf_document[word] = tf[word] * idf.get(word, 0)  # Calculate TF-IDF value
        
        tfidf.append(tfidf_document)
    
    return tfidf

# Example document set
documents = [
    "google is a search engine".split(),
    "google provides various services".split(),
    "amazon is an online store".split()
]

# Calculate TF-IDF values for each document
tfidf_results = compute_tfidf(documents)

# Output TF-IDF results for each document
for i, tfidf in enumerate(tfidf_results):
    print(f"Document {i+1} TF-IDF:")
    for word, score in tfidf.items():
        print(f"  {word}: {score:.4f}")
    print()

3. Code Explanation

  • Calculating TF:
    The compute_tf function calculates the term frequency (TF) for each word in a document. TF is the number of times a word appears in the document divided by the total number of words in the document.

    tf[word] = count / word_count
    
  • Calculating IDF:
    The compute_idf function calculates the inverse document frequency (IDF) for each word in the entire document set. IDF is calculated by the formula:

    IDF ( t , D ) = log ⁡ ( N Number of documents containing the word  t ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{Number of documents containing the word } t} \right) IDF(t,D)=log(Number of documents containing the word tN)

    Where ( N ) is the total number of documents, and the number of documents containing the word ( t ) determines the IDF value.

  • Calculating TF-IDF:
    The compute_tfidf function combines the TF and IDF to calculate the TF-IDF for each word in a document. The formula is:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    By multiplying the term frequency (TF) of the document by the inverse document frequency (IDF) of each word, we obtain the TF-IDF values for each word.

4. Example Output

Assuming we run the above code, the output might look like this:

Document 1 TF-IDF:
  google: 0.0000
  is: 0.4055
  a: 0.4055
  search: 0.4055
  engine: 0.4055

Document 2 TF-IDF:
  google: 0.0000
  provides: 0.4055
  various: 0.4055
  services: 0.4055
  gives: 0.0000
  information: 0.0000

Document 3 TF-IDF:
  amazon: 0.4055
  is: 0.4055
  an: 0.4055
  online: 0.4055
  store: 0.4055
Output Explanation:
  • TF-IDF values: For each document, the TF-IDF value indicates how significant each word is for that document. For example, “google” appears in both Document 1 and Document 2, but its IDF value is 0, indicating that the word is common across the documents and therefore has a low TF-IDF score.
  • Combining TF and IDF: By combining TF and IDF, we can assess the importance of each word in the context of a particular document. Words that appear frequently in a document but are rare across other documents will have a higher TF-IDF score.

5. Extensions

This implementation is a simple example, and there are several ways to extend it:

  • Handling larger datasets: This implementation works for small datasets. For larger datasets, optimizations like parallel computing or more efficient data structures may be necessary.
  • Removing stopwords: To improve the quality of TF-IDF calculations, you can remove common stopwords (e.g., “is”, “the”, “and”) from the text.
  • Other text preprocessing: You could add preprocessing steps like lowercasing, stemming, or lemmatization to improve the TF-IDF scores and make the algorithm more robust.

This basic implementation provides a good starting point for understanding how TF-IDF works and can be adapted for more complex applications.

后记

2024年12月27日16点34分于上海,在GPT4o mini大模型辅助下完成。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2269356.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Mac 环境 VVenC 编译与编码命令行工具使用教程

VVenC VVenC 是一个开源的高效视频编码器,专门用于支持 H.266/VVC (Versatile Video Coding) 标准的编码。H.266/VVC 是继 HEVC (H.265) 之后的新一代视频编码标准,主要目的是提供比 HEVC 更高的压缩效率,同时保持或提高视频质量。H.266/VVC…

HTML——41有序列表

<!DOCTYPE html> <html><head><meta charset"UTF-8"><title>有序列表</title></head><body><!--有序列表&#xff1a;--><!--1.列表中各个元素在逻辑上有先后顺序&#xff0c;但不存在一定的级别关系-->…

详细教程:SQL2008数据库备份与还原全流程!

数据的安全性至关重要&#xff0c;无论是操作系统、重要文件、磁盘存储&#xff0c;还是企业数据库&#xff0c;备份都是保障其安全和完整性的关键手段。拥有备份意味着即使发生误删、系统崩溃或病毒攻击等问题&#xff0c;也能迅速通过恢复功能解决&#xff0c;避免数据丢失带…

IDEA工具使用介绍、IDEA常用设置以及如何集成Git版本控制工具

文章目录 一、IDEA二、建立第一个 Java 程序三、IDEA 常用设置四、IDEA 集成版本控制工具&#xff08;Git、GitHub&#xff09;4.1 IDEA 拉 GitHub/Git 项目4.2 IDEA 上传 项目到 Git4.3 更新提交命令 一、IDEA 1、什么是IDEA&#xff1f; IDEA&#xff0c;全称为 IntelliJ ID…

STM32 I2C通信协议

单片机学习&#xff01; 文章目录 目录 文章目录 前言 一、I2C通信 1.1 I2C总线 1.2 I2C通信线 1.3 同步半双工且数据应答 1.4 一主多从 二、硬件电路 2.1 I2C电路模型 2.2 I2C接线要求 2.3 I2C上拉电阻作用 三、I2C时序基本单元 3.1 起始终止条件 3.1.1 起始条件 3.1.2 终止条…

在线免费批量生成 Word 文档工具

为了方便的批量生成 Word 文档&#xff0c;写了个在线 Word 文档批量生成工具&#xff0c;可以根据 Excel 数据和 Word 模板批量生成大量个性化的 Word 文档。适用于需要批量生成格式统一但内容不同的文档场景。比如&#xff1a; 批量生成证书、奖状批量生成合同、协议批量生成…

再见了我的2024

目录 户外运动 阅读及影视剧欣赏 提升专业技能 外婆走了 消费分析 小妹家的二宝 2024 summary 2025年几个小目标 2024生活记录 2024年最后一天&#xff0c;就这样过去了。 总是来不及好好地告个别&#xff0c;就像我们的年老的亲人一样&#xff0c;见一面&#xff0c;少…

Java工程师实现视频文件上传minio文件系统存储及网页实现分批加载视频播放

Java工程师实现minio存储大型视频文件网页实现分批加载视频播放 一、需求说明 老板给我出个题目&#xff0c;让我把的电影文件上传到minio文件系统&#xff0c;再通过WEB端分配加载视频播放&#xff0c;类似于我们普通的电影网站。小编把Java代码共享出来。是真正的能拿过来直…

Three.js教程004:坐标辅助器与轨道控制器

文章目录 坐标辅助器与轨道控制器实现效果添加坐标辅助器添加轨道控制器完整代码完整代码下载坐标辅助器与轨道控制器 实现效果 添加坐标辅助器 创建坐标辅助器: const axesHelper = new Three.AxesHelper(5);添加到场景中: scene.

【优选算法 分治】深入理解分治算法:分治算法入门小专题详解

快速排序算法 (1) 快速排序法 (2) 快排前后指针 (3) 快排挖坑法 颜色分类 题目解析 算法原理 算法原理和移动零非常相似 简述移动零的算法原理 cur 在从前往后扫描的过程中&#xff0c;如果扫描的数符合 f 性质&#xff0c;就把这个数放到 dest 之…

Qt5 中 QGroupBox 标题下沉问题解决

我们设置了QGroupBox 样式之后,发现标题下沉了,那么如何解决呢? QGroupBox {font: 12pt "微软雅黑";color:white;border:1px solid white;border-radius:6px; } 解决后的效果 下面是解决方法: QGroupBox {font: 12pt "微软雅黑";color:white;bo…

sentinel-请求限流、线程隔离、本地回调、熔断

请求限流&#xff1a;控制QPS来达到限流的目的 线程隔离&#xff1a;控制线程数量来达到限流的目录 本地回调&#xff1a;当线程被限流、隔离、熔断之后、就不会发起远程调用、而是使用本地已经准备好的回调去提醒用户 服务熔断&#xff1a;熔断也叫断路器&#xff0c;当失败、…

体验Cursor一段时间后的感受和技巧

用这种LLM辅助的IDE一段时间了&#xff0c;断断续续做了几个小项目了&#xff0c;总结一下整体的感受和自己的一些使用经验。 从Cursor开始又回到Cursor 第一个真正开始使用的LLM的辅助开发IDE就是Cursor&#xff0c;Github的Copilot支持尝试过&#xff0c;但是并没有真正的在…

【数据仓库】hadoop3.3.6 安装配置

文章目录 概述下载解压安装伪分布式模式配置hdfs配置hadoop-env.shssh免密登录模式设置初始化HDFS启动hdfs配置yarn启动yarn 概述 该文档是基于hadoop3.2.2版本升级到hadoop3.3.6版本&#xff0c;所以有些配置&#xff0c;是可以不用做的&#xff0c;下面仅记录新增操作&#…

宽带、光猫、路由器、WiFi、光纤之间的关系

1、宽带&#xff08;Broadband&#xff09; 1.1 宽带的定义宽带指的是一种高速互联网接入技术&#xff0c;通常包括ADSL、光纤、4G/5G等不同类型的接入方式。宽带的关键特点是能够提供较高的数据传输速率&#xff0c;使得用户可以享受到稳定的上网体验。 1.2 宽带的作用宽带是…

[2025] 如何在 Windows 计算机上轻松越狱 IOS 设备

笔记 1. 首次启动越狱工具时&#xff0c;会提示您安装驱动程序。单击“是”确认安装&#xff0c;然后再次运行越狱工具。 2. 对于Apple 6s-7P和iPad系列&#xff08;iOS14.4及以上&#xff09;&#xff0c;您应该点击“Optinos”并勾选“允许未经测试的iOS/iPadOS/tvOS版本”&…

Linux SVN下载安装配置客户端

参考&#xff1a; linux下svn服务器搭建及使用&#xff08;包含图解&#xff09;_小乌龟svn新建用户名和密码-CSDN博客 1.ubuntu安装svn客户端 “subversion” sudo apt-get update sudo apt-get install subversion 查看安装的版本信息&#xff0c;同时看是否安装成功 s…

【Windows】Windows系统查看目录中子目录占用空间大小

在对应目录下通过powershell命令查看文件夹及文件大小&#xff0c;不需要管理员权限。 以下为方式汇总&#xff1a; 方式1&#xff08;推荐&#xff0c;免费下载使用&#xff0c;界面友好&#xff09;&#xff1a; 使用工具以下是一些第三方工具treesize_free https://www.ja…

【论文阅读笔记】IceNet算法与代码 | 低照度图像增强 | IEEE | 2021.12.25

目录 1 导言 2 相关工作 A 传统方法 B 基于CNN的方法 C 交互方式 3 算法 A 交互对比度增强 1)Gamma estimation 2)颜色恢复 3)个性化初始η B 损失函数 1)交互式亮度控制损失 2)熵损失 3)平滑损失 4)总损失 C 实现细节 4 实验 5 IceNet环境配置和运行 1 下载…

L25.【LeetCode笔记】 三步问题的四种解法(含矩阵精彩解法!)

目录 1.题目 2.三种常规解法 方法1:递归做 ​编辑 方法2:改用循环做 初写的代码 提交结果 分析 修改后的代码 提交结果 for循环的其他写法 提交结果 方法3:循环数组 提交结果 3.方法4:矩阵 算法 代码实践 1.先计算矩阵n次方 2.后将矩阵n次方嵌入递推式中 提…