文本处理以及求相似度

news2025/4/12 12:37:11

常规操作。先读取文档
在这里插入图片描述

## 1. 分词 ， 清洗关键词， 
#     # 删除  特殊字符 
#     PATTERN = r'[?|$|&|*|%|@|(|)|~]'
#     text = re.sub(PATTERN, r'', text)


# string manipulation libs
import re
import string
import nltk
from nltk.corpus import stopwords


def cleaning_text(text: str,  ) -> str:
    # remove special chars and numbers #  删除 特殊字符 和数字  
    text = re.sub("[^A-Za-z]+", " ", text)
    # remove stopwords
    # 1. tokenize
    tokens = nltk.word_tokenize(text)
    # 2. check if stopword
    tokens = [w for w in tokens if not w.lower() in stopwords.words("english")]
    # 3. join back together
    text = " ".join(tokens)
    # return text in lower case and stripped of whitespaces
    text = text.lower().strip()
    return text
df_wd['cleaned'] = df_wd['words'].apply(lambda x: cleaning_text(x))

稀疏矩阵

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.95)
# fit_transform applies TF-IDF to clean texts - we save the array of vectors in X
X = vectorizer.fit_transform( df_wd['cleaned'].tolist() )
vectorizer.get_feature_names_out()

在这里插入图片描述

相似度对比：

余弦相似度

在这里插入图片描述

# Let's import text feature extraction TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
docs=['I love pets.','I hate pets.']

# Initialize TfidfVectorizer object
tfidf= TfidfVectorizer()

# Fit and transform the given data
tfidf_vector = tfidf.fit_transform(docs)

# Import cosine_similarity metrics
from sklearn.metrics.pairwise import cosine_similarity

# compute similarity using cosine similarity
cos_sim=cosine_similarity(tfidf_vector, tfidf_vector)

print(cos_sim)

使用余弦相似度需要词向量
在这里插入图片描述

jaccard 相似度，求的是相同单词的交集涉及到单复数等问题

直接求句子的相似度

def jaccard_similarity(sent1, sent2):
    """Find text similarity using jaccard similarity"""
    # Tokenize sentences
    token1 = set(sent1.split())
    token2 = set(sent2.split())
     
    # intersection between tokens of two sentences    
    intersection_tokens = token1.intersection(token2)
    
    # Union between tokens of two sentences
    union_tokens=token1.union(token2)
    
    # Cosine Similarity
    sim_= float(len(intersection_tokens) / len(union_tokens))
    return sim_

jaccard_similarity('I love pets.','I hate pets.')

也就是说，在处理文本数据时很少使用 Jaccard 相似度，因为它不适用于文本嵌入。这意味着仅限于评估文本的词汇相似性，即文档在单词级别上的相似程度。

就余弦和欧几里德度量而言，两者之间的区别因素是余弦相似度不受特征向量的大小/长度的影响。假设我们正在创建一个主题标记算法。如果一个词（例如 senate）在文档 1 中比在文档 2 中出现的频率更高，我们可以假设文档 1 与政治主题的相关性更高。但是，也可能是我们正在处理不同长度的新闻文章。然后，“参议院”一词可能在文件 1 中出现得更多，仅仅是因为它更长。正如我们之前在重复“空”这个词时看到的那样，余弦相似度对长度差异不太敏感。

除此之外，欧氏距离不适用于文本嵌入的稀疏向量。因此，在处理文本数据时，余弦相似度通常优于欧氏距离。想到的唯一对长度敏感的文本相似性用例是剽窃检测。

最重要的参考链接进行理解、

理解链接：

文本规范化做关键预处理
https://wiki.shileizcc.com/confluence/pages/viewpage.action?pageId=42533117
文本相似度
https://subscription.packtpub.com/book/data/9781789955248/16/ch16lvl1sec65/text-similarity
黑马文本处理
https://book.itheima.net/course/221/1270308811000782849/1271374300858818562
学习NLP
https://github.com/jevy146/66Days__NaturalLanguageProcessing
5.英文句子相似性判断比较完整
https://www.cnblogs.com/infaraway/p/8666269.html