解锁搜索新境界!让文本语义匹配助你轻松找到你需要的一切!(快速上手baseline)
实现了多种相似度计算、匹配搜索算法,支持文本、图像,python3开发,pip安装,开箱即用。
-
文本相似度计算(文本匹配)
- 余弦相似(Cosine Similarity):两向量求余弦
- 点积(Dot Product):两向量归一化后求内积
- 汉明距离(Hamming Distance),编辑距离(Levenshtein Distance),欧氏距离(Euclidean Distance),曼哈顿距离(Manhattan Distance)等
-
语义模型
- [CoSENT文本匹配模型]【推荐】
- BERT模型(文本向量表征)
- SentenceBERT文本匹配模型
-
字面模型
- [Word2Vec文本浅层语义表征]【推荐】
- 同义词词林
- 知网Hownet义原匹配
- BM25、RankBM25
- TFIDF
- SimHash
-
图像相似度计算(图像匹配)
- 语义模型
- [CLIP(Contrastive Language-Image Pre-Training)]
- VGG(doing)
- ResNet(doing)
- 语义模型
-
特征提取
- [pHash]【推荐】, dHash, wHash, aHash
- SIFT, Scale Invariant Feature Transform(SIFT)
- SURF, Speeded Up Robust Features(SURF)(doing)
-
图文相似度计算
- [CLIP(Contrastive Language-Image Pre-Training)]
-
匹配搜索
- [SemanticSearch]:向量相似检索,使用Cosine
Similarty + topk高效计算,比一对一暴力计算快一个数量级
- [SemanticSearch]:向量相似检索,使用Cosine
项目链接见文末
环境安装:
!pip install --upgrade pip -i https://mirrors.cloud.tencent.com/pypi/simple
!pip install -U similarities -i https://mirrors.cloud.tencent.com/pypi/simple
#安装依赖库
!pip install -r /home/mw/project/similarities-main/requirements.txt -i https://mirrors.cloud.tencent.com/pypi/simple
#安装高版本torch,安装完后重启内核
!pip install torch==1.12.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu -i https://mirrors.cloud.tencent.com/pypi/simple
1.1 文本语义相似度计算
from similarities import Similarity
#模型文件在input路径下,有中文和多语言两个模型,可自行选择
m = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型",max_seq_length=128)
r = m.similarity('今天的天气不错是晴天', '今天天气很好阳光明媚')
print(f"similarity score: {float(r)}")
2023-09-11 02:36:44.046 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
similarity score: 0.7727918028831482
Similarity的默认方法:
Similarity(corpus: Union[List[str], Dict[str, str]] = None,
model_name_or_path="shibing624/text2vec-base-chinese",
max_seq_length=128)
- 返回值:余弦值
score
范围是[-1, 1],值越大越相似 corpus
:搜索用的doc集,仅搜索时需要,输入格式:句子列表List[str]
或者{corpus_id: sentence}的Dict[str, str]
格式model_name_or_path
:模型名称或者模型路径,默认会从HF model hub下载并使用中文语义匹配模型shibing624/text2vec-base-chinese,如果是多语言景,可以替换为多语言匹配模型shibing624/text2vec-base-multilingualmax_seq_length
:输入句子的最大长度,最大为匹配模型支持的最大长度,BERT系列是512
1.2. 文本语义匹配搜索
一般在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本相似检索等任务。
from similarities import Similarity
# 1.Compute cosine similarity between two sentences.
sentences = ['今天的天气不错是晴天',
'今天天气很好阳光明媚']
corpus = [
'今天天气很好阳光明媚',
'在好天气里,我喜欢去散步',
'这本书太无聊了,我无法读下去',
'我喜欢去海边度假,感受阳光和海风',
'在旅行途中,我们遇到了许多有趣的人',
'我喜欢随时随地用耳机听音乐',
]
model = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型")
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
print('-' * 50 + '\n')
# 2.Compute similarity between two list
similarity_scores = model.similarity(sentences, corpus)
print(similarity_scores.numpy())
for i in range(len(sentences)):
for j in range(len(corpus)):
print(f"{sentences[i]} vs {corpus[j]}, score: {similarity_scores.numpy()[i][j]:.4f}")
print('-' * 50 + '\n')
# 3.Semantic Search
model.add_corpus(corpus)
res = model.most_similar(queries=sentences, topn=3)
print(res)
for q_id, c in res.items():
print('query:', sentences[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:43:19.744 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
Similarity: Similarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>
今天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
--------------------------------------------------
2023-09-11 02:43:22.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
[[0.77279186 0.377486 0.2831661 0.3328314 0.33157927 0.271398 ]
[1. 0.4531002 0.22196919 0.42843264 0.31628954 0.28194088]]
今天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
今天的天气不错是晴天 vs 在好天气里,我喜欢去散步, score: 0.3775
今天的天气不错是晴天 vs 这本书太无聊了,我无法读下去, score: 0.2832
今天的天气不错是晴天 vs 我喜欢去海边度假,感受阳光和海风, score: 0.3328
今天的天气不错是晴天 vs 在旅行途中,我们遇到了许多有趣的人, score: 0.3316
今天的天气不错是晴天 vs 我喜欢随时随地用耳机听音乐, score: 0.2714
今天天气很好阳光明媚 vs 今天天气很好阳光明媚, score: 1.0000
今天天气很好阳光明媚 vs 在好天气里,我喜欢去散步, score: 0.4531
今天天气很好阳光明媚 vs 这本书太无聊了,我无法读下去, score: 0.2220
今天天气很好阳光明媚 vs 我喜欢去海边度假,感受阳光和海风, score: 0.4284
今天天气很好阳光明媚 vs 在旅行途中,我们遇到了许多有趣的人, score: 0.3163
今天天气很好阳光明媚 vs 我喜欢随时随地用耳机听音乐, score: 0.2819
--------------------------------------------------
2023-09-11 02:43:23.945 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 6, emb len: 6
{0: {0: 0.772791862487793, 1: 0.377485990524292, 3: 0.33283141255378723}, 1: {0: 1.0, 1: 0.45310020446777344, 3: 0.4284326434135437}}
query: 今天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
在好天气里,我喜欢去散步: 0.3775
我喜欢去海边度假,感受阳光和海风: 0.3328
query: 今天天气很好阳光明媚
search top 3:
今天天气很好阳光明媚: 1.0000
在好天气里,我喜欢去散步: 0.4531
我喜欢去海边度假,感受阳光和海风: 0.4284
余弦
score
的值范围[-1, 1],值越大,表示该query与corpus的文本越相似。
1.3 多语言文本语义相似度计算和匹配搜索
多语言:包括中、英、韩、日、德、意等多国语言
from similarities import Similarity
# Two lists of sentences
sentences1 = [
'The cat sits outside',
'A man is playing guitar',
'The new movie is awesome',
'花呗更改绑定银行卡',
'The quick brown fox jumps over the lazy dog.',
]
sentences2 = [
'The dog plays in the garden',
'A woman watches TV',
'The new movie is so great',
'如何更换花呗绑定银行卡',
'敏捷的棕色狐狸跳过了懒狗',
]
model = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/多语言模型")
# 使用的是多语言文本匹配模型
scores = model.similarity(sentences1, sentences2)
print('1:use Similarity compute cos scores\n')
for i in range(len(sentences1)):
for j in range(len(sentences2)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], scores[i][j]))
print('-' * 50 + '\n')
print('2:search\n')
# 2.Semantic Search
corpus = [
'The cat sits outside',
'A man is playing guitar',
'I love pasta',
'The new movie is awesome',
'The cat plays in the garden',
'A woman watches TV',
'The new movie is so great',
'Do you like pizza?',
'如何更换花呗绑定银行卡',
'敏捷的棕色狐狸跳过了懒狗',
'猫在窗外',
'电影很棒',
]
model.add_corpus(corpus)
model.save_index('en_corpus_emb.json')
res = model.most_similar(queries=sentences1, topn=3)
print(res)
del model
model = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/多语言模型")
model.load_index('en_corpus_emb.json')
res = model.most_similar(queries=sentences1, topn=3)
print(res)
for q_id, c in res.items():
print('query:', sentences1[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:46:32.262 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 02:46:33.748 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 12
1:use Similarity compute cos scores
The cat sits outside The dog plays in the garden Score: 0.6211
The cat sits outside A woman watches TV Score: 0.4926
The cat sits outside The new movie is so great Score: 0.5312
The cat sits outside 如何更换花呗绑定银行卡 Score: 0.4604
The cat sits outside 敏捷的棕色狐狸跳过了懒狗 Score: 0.4951
A man is playing guitar The dog plays in the garden Score: 0.6483
A man is playing guitar A woman watches TV Score: 0.5747
A man is playing guitar The new movie is so great Score: 0.5524
A man is playing guitar 如何更换花呗绑定银行卡 Score: 0.5098
A man is playing guitar 敏捷的棕色狐狸跳过了懒狗 Score: 0.5210
The new movie is awesome The dog plays in the garden Score: 0.5940
The new movie is awesome A woman watches TV Score: 0.5510
The new movie is awesome The new movie is so great Score: 0.9822
The new movie is awesome 如何更换花呗绑定银行卡 Score: 0.4767
The new movie is awesome 敏捷的棕色狐狸跳过了懒狗 Score: 0.5523
花呗更改绑定银行卡 The dog plays in the garden Score: 0.4788
花呗更改绑定银行卡 A woman watches TV Score: 0.3842
花呗更改绑定银行卡 The new movie is so great Score: 0.4845
花呗更改绑定银行卡 如何更换花呗绑定银行卡 Score: 0.9377
花呗更改绑定银行卡 敏捷的棕色狐狸跳过了懒狗 Score: 0.4546
The quick brown fox jumps over the lazy dog. The dog plays in the garden Score: 0.7547
The quick brown fox jumps over the lazy dog. A woman watches TV Score: 0.4952
The quick brown fox jumps over the lazy dog. The new movie is so great Score: 0.5761
The quick brown fox jumps over the lazy dog. 如何更换花呗绑定银行卡 Score: 0.4426
The quick brown fox jumps over the lazy dog. 敏捷的棕色狐狸跳过了懒狗 Score: 0.9290
--------------------------------------------------
2:search
2023-09-11 02:46:34.448 | INFO | similarities.similarity:add_corpus:155 - Add 12 docs, total: 12, emb len: 12
2023-09-11 02:46:34.468 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: en_corpus_emb.json.
{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}
2023-09-11 02:46:37.260 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}
query: The cat sits outside
search top 3:
The cat sits outside: 1.0000
猫在窗外: 0.8199
The cat plays in the garden: 0.8007
query: A man is playing guitar
search top 3:
A man is playing guitar: 1.0000
The cat plays in the garden: 0.5819
A woman watches TV: 0.5747
query: The new movie is awesome
search top 3:
The new movie is awesome: 1.0000
The new movie is so great: 0.9822
电影很棒: 0.8939
query: 花呗更改绑定银行卡
search top 3:
如何更换花呗绑定银行卡: 0.9377
A man is playing guitar: 0.5211
The cat sits outside: 0.4919
query: The quick brown fox jumps over the lazy dog.
search top 3:
敏捷的棕色狐狸跳过了懒狗: 0.9290
The cat plays in the garden: 0.6580
猫在窗外: 0.6019
1.4. 快速近似文本语义匹配搜索(Annoy和Hnswlib:百万数据集)
支持Annoy、Hnswlib的近似语义匹配搜索,常用于百万数据集的匹配搜索任务。
上亿级别的可以使用Milvus向量数据库,检索很快,下面样例是在1000w数据集下测试
- 效果预览:
性能对比:
硬件配置 | 向量库数据量 | 提取特征所需时间 | milvus检索所需时间 | 排序所需时间 | 总耗时 |
---|---|---|---|---|---|
CPU 12核 2.5GHz | 1000w 大小15GB左右 | 64.5ms | 258.3ms | 871.6 ms | 1.19s |
CPU + Tesla V100 32G | 1000w 大小15GB左右 | 10ms | 213.6ms | 24.1ms | 0.25s |
- 项目专栏链接:欢迎fork
基于Milvus+ERNIE+SimCSE+In-batch Negatives样本策略的学术文献语义检索系统
下面展示demo:
Annoy和Hnswlib是两个常用的近似语义匹配搜索库,它们都可以用于高效地搜索最接近给定向量的邻居。
-
Annoy(Approximate Nearest Neighbors Oh Yeah):
- Annoy 是一种基于树结构的近似最近邻算法,其中树被构建为一种特殊的二叉搜索树。它使用欧氏距离进行近似邻居搜索。
- Annoy 支持快速插入和更新数据,并且占用较少的内存空间。它的搜索速度快,尤其适用于高维向量的近似搜索。
- Annoy 可以用于各种任务,如推荐系统、图像和文本处理等。
- Annoy 的接口简单易用,可在多个编程语言中使用,如Python、C++等。
-
Hnswlib(Hierarchical Navigable Small World Library):
- Hnswlib 也是一种基于树结构的近似最近邻算法,它使用一种叫做 “层级可导航小世界” 的数据结构。通过构建多个层级的索引结构,它能够快速搜索最接近的邻居。
- Hnswlib 支持欧氏距离和角度距离等多种距离度量方式,使得它适用于不同的应用场景。
- Hnswlib 的索引结构可以在线更新,支持高效地添加和删除向量。
- Hnswlib 提供了多线程搜索功能,可以提高搜索速度,并且可以在大规模数据集上进行高效搜索。
- Hnswlib 在C++中实现,但也提供了Python的绑定接口。
综上所述,Annoy 是一种基于树结构的近似最近邻算法,适用于高维向量的近似搜索;而Hnswlib 是基于 “层级可导航小世界” 结构的近似最近邻算法,支持多种距离度量方式,并适用于大规模数据集的高效搜索
更多内容参考:
推荐系统[九]项目技术细节讲解z1:Elasticsearch 如何进行快速检索(ES倒排索引和分词原理)以及倒排索引在召回中的应用。
推荐系统[九]项目技术细节讲解z3:向量检索技术与ANN搜索算法[KD树、Annoy、LSH局部哈希、PQ乘积量化、IVFPQ倒排乘积量化、HNSW层级图搜索等],超级详细技术原理讲解
import os
import sys
#加载路径导入函数
sys.path.append("/home/mw/project/similarities-main/similarities")
from fastsim import AnnoySimilarity
from fastsim import HnswlibSimilarity
#需要注意,请修改/home/mw/project/similarities-main/similarities/fastsim.py 对应模型路径,修改为model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型",
sentences = ['今天的天气不错是晴天',
'今天天气很好阳光明媚']
corpus = [
'今天天气很好阳光明媚',
'在好天气里,我喜欢去散步',
'这本书太无聊了,我无法读下去',
'我喜欢去海边度假,感受阳光和海风',
'在旅行途中,我们遇到了许多有趣的人',
'我喜欢随时随地用耳机听音乐',
]
def annoy_demo():
corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)]
model = AnnoySimilarity(corpus=corpus_new)
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
model.add_corpus(corpus)
model.build_index()
model.save_index('annoy_model.bin')
print(model.most_similar("men喜欢这首歌"))
# Semantic Search batch
del model
model = AnnoySimilarity()
model.load_index('annoy_model.bin')
print(model.most_similar("men喜欢这首歌"))
queries = ["今天的天气不错是晴天", "men喜欢这首歌"]
res = model.most_similar(queries, topn=3)
print(res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
# os.remove('annoy_model.bin')
print('-' * 50 + '\n')
def hnswlib_demo():
corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)]
print(corpus_new)
model = HnswlibSimilarity(corpus=corpus_new)
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
model.add_corpus(corpus)
model.build_index()
model.save_index('hnsw_model.bin')
print(model.most_similar("men喜欢这首歌"))
# Semantic Search batch
del model
model = HnswlibSimilarity()
model.load_index('hnsw_model.bin')
print(model.most_similar("men喜欢这首歌"))
queries = ["今天的天气不错是晴天", "men喜欢这首歌"]
res = model.most_similar(queries, topn=3)
print(res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
# os.remove('hnsw_model.bin')
print('-' * 50 + '\n')
if __name__ == '__main__':
annoy_demo()
hnswlib_demo()
2023-09-11 03:09:23.344 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:23.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 60
2023-09-11 03:09:28.151 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 60
2023-09-11 03:09:28.241 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
2023-09-11 03:09:28.242 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.
Similarity: AnnoySimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 60
2023-09-11 03:09:29.445 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
今天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
2023-09-11 03:09:30.544 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 66
2023-09-11 03:09:30.544 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
2023-09-11 03:09:30.545 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.
2023-09-11 03:09:30.669 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: annoy_model.bin.json.
2023-09-11 03:09:30.669 | INFO | fastsim:save_index:67 - Saving Annoy index to: annoy_model.bin, corpus embedding to: annoy_model.bin.json
{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}
2023-09-11 03:09:32.450 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:32.544 | INFO | fastsim:load_index:75 - Loading index from: annoy_model.bin, corpus embedding from: annoy_model.bin.json
2023-09-11 03:09:32.566 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}
{0: {60: 0.7727918750653213, 6: 0.7462793254362339, 36: 0.7355303251593384}, 1: {59: 0.4495447165407924, 29: 0.44775863580996855, 5: 0.4468400604954468}}
query: 今天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
今天天气很好阳光明媚6: 0.7463
今天天气很好阳光明媚36: 0.7355
query: men喜欢这首歌
search top 3:
我喜欢随时随地用耳机听音乐59: 0.4495
我喜欢随时随地用耳机听音乐29: 0.4478
我喜欢随时随地用耳机听音乐5: 0.4468
--------------------------------------------------
['今天天气很好阳光明媚0', '在好天气里,我喜欢去散步1', '这本书太无聊了,我无法读下去2', '我喜欢去海边度假,感受阳光和海风3', '在旅行途中,我们遇到了许多有趣的人4', '我喜欢随时随地用耳机听音乐5', '今天天气很好阳光明媚6', '在好天气里,我喜欢去散步7', '这本书太无聊了,我无法读下去8', '我喜欢去海边度假,感受阳光和海风9', '在旅行途中,我们遇到了许多有趣的人10', '我喜欢随时随地用耳机听音乐11', '今天天气很好阳光明媚12', '在好天气里,我喜欢去散步13', '这本书太无聊了,我无法读下去14', '我喜欢去海边度假,感受阳光和海风15', '在旅行途中,我们遇到了许多有趣的人16', '我喜欢随时随地用耳机听音乐17', '今天天气很好阳光明媚18', '在好天气里,我喜欢去散步19', '这本书太无聊了,我无法读下去20', '我喜欢去海边度假,感受阳光和海风21', '在旅行途中,我们遇到了许多有趣的人22', '我喜欢随时随地用耳机听音乐23', '今天天气很好阳光明媚24', '在好天气里,我喜欢去散步25', '这本书太无聊了,我无法读下去26', '我喜欢去海边度假,感受阳光和海风27', '在旅行途中,我们遇到了许多有趣的人28', '我喜欢随时随地用耳机听音乐29', '今天天气很好阳光明媚30', '在好天气里,我喜欢去散步31', '这本书太无聊了,我无法读下去32', '我喜欢去海边度假,感受阳光和海风33', '在旅行途中,我们遇到了许多有趣的人34', '我喜欢随时随地用耳机听音乐35', '今天天气很好阳光明媚36', '在好天气里,我喜欢去散步37', '这本书太无聊了,我无法读下去38', '我喜欢去海边度假,感受阳光和海风39', '在旅行途中,我们遇到了许多有趣的人40', '我喜欢随时随地用耳机听音乐41', '今天天气很好阳光明媚42', '在好天气里,我喜欢去散步43', '这本书太无聊了,我无法读下去44', '我喜欢去海边度假,感受阳光和海风45', '在旅行途中,我们遇到了许多有趣的人46', '我喜欢随时随地用耳机听音乐47', '今天天气很好阳光明媚48', '在好天气里,我喜欢去散步49', '这本书太无聊了,我无法读下去50', '我喜欢去海边度假,感受阳光和海风51', '在旅行途中,我们遇到了许多有趣的人52', '我喜欢随时随地用耳机听音乐53', '今天天气很好阳光明媚54', '在好天气里,我喜欢去散步55', '这本书太无聊了,我无法读下去56', '我喜欢去海边度假,感受阳光和海风57', '在旅行途中,我们遇到了许多有趣的人58', '我喜欢随时随地用耳机听音乐59']
2023-09-11 03:09:34.947 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:35.044 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 60
2023-09-11 03:09:40.349 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 60
2023-09-11 03:09:40.350 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
2023-09-11 03:09:40.351 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 60
2023-09-11 03:09:40.351 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 64
2023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 400
2023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 50
Similarity: HnswlibSimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 60
2023-09-11 03:09:41.345 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
今天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
2023-09-11 03:09:42.548 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 66
2023-09-11 03:09:42.550 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
2023-09-11 03:09:42.551 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 66
2023-09-11 03:09:42.551 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 64
2023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 400
2023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 50
2023-09-11 03:09:42.717 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: hnsw_model.bin.json.
2023-09-11 03:09:42.718 | INFO | fastsim:save_index:172 - Saving hnswlib index to: hnsw_model.bin, corpus embedding to: hnsw_model.bin.json
{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}
2023-09-11 03:09:44.547 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:44.643 | INFO | fastsim:load_index:180 - Loading index from: hnsw_model.bin, corpus embedding from: hnsw_model.bin.json
2023-09-11 03:09:44.665 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
Warning: Calling load_index for an already inited index. Old index is being deallocated.
{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}
{0: {60: 0.7727917432785034, 6: 0.746279239654541, 36: 0.7355299592018127}, 1: {59: 0.4495447278022766, 29: 0.4477585554122925, 5: 0.44683998823165894}}
query: 今天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
今天天气很好阳光明媚6: 0.7463
今天天气很好阳光明媚36: 0.7355
query: men喜欢这首歌
search top 3:
我喜欢随时随地用耳机听音乐59: 0.4495
我喜欢随时随地用耳机听音乐29: 0.4478
我喜欢随时随地用耳机听音乐5: 0.4468
--------------------------------------------------
1.5 基于字面的文本相似度计算和匹配搜索
支持同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索,常用于文本匹配冷启动。
–> 480 self.cilin_dict = self.load_cilin_dict(cilin_path) # Cilin(词林) semantic dictionary
481 self.corpus = {}
482
/opt/conda/lib/python3.7/site-packages/similarities/literalsim.py in load_cilin_dict(path)
522 “”“加载词林语义词典”“”
523 sem_dict = {}
–> 524 for line in open(path, ‘r’, encoding=‘utf-8’):
525 line = line.strip()
526 terms = line.split(’ ')
FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/conda/lib/python3.7/site-packages/similarities/data/cilin.txt’
添加词库路径修改literalsim.py文件
default_cilin_path=‘/home/mw/project/similarities-main/similarities/data/cilin.txt’
import sys
from loguru import logger
sys.path.append('/home/mw/project/similarities-main')
sys.path.append('/home/mw/project/similarities-main/similarities')
from similarities import (
SimHashSimilarity,
TfidfSimilarity,
BM25Similarity,
WordEmbeddingSimilarity,
CilinSimilarity,
HownetSimilarity,
SameCharsSimilarity,
SequenceMatcherSimilarity,
)
logger.remove()
logger.add(sys.stderr, level="INFO")
def sim_and_search(m):
print(m)
if 'BM25' not in str(m):
sim_scores = m.similarity(text1, text2)
print('sim scores: ', sim_scores)
for (idx, i), j in zip(enumerate(text1), text2):
s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
print(f"{i} vs {j}, score: {s:.4f}")
m.add_corpus(corpus)
res = m.most_similar(queries, topn=3)
print('sim search: ', res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{m.corpus[corpus_id]}: {s:.4f}')
print('-' * 50 + '\n')
if __name__ == '__main__':
text1 = [
'如何更换花呗绑定银行卡',
'花呗更改绑定银行卡'
]
text2 = [
'花呗更改绑定银行卡',
'我什么时候开通了花呗',
]
corpus = [
'花呗更改绑定银行卡',
'我什么时候开通了花呗',
'俄罗斯警告乌克兰反对欧盟协议',
'暴风雨掩埋了东北部;新泽西16英寸的降雪',
'中央情报局局长访问以色列叙利亚会谈',
'人在巴基斯坦基地的炸弹袭击中丧生',
]
queries = [
'我的花呗开通了?',
'乌克兰被俄罗斯警告',
'更改绑定银行卡',
]
print('text1: ', text1)
print('text2: ', text2)
print('query: ', queries)
sim_and_search(SimHashSimilarity())
sim_and_search(TfidfSimilarity())
sim_and_search(BM25Similarity())
sim_and_search(WordEmbeddingSimilarity())
# sim_and_search(CilinSimilarity()) #词库路径在/home/mw/project/similarities-main/similarities/data/下,自行添加
# sim_and_search(HownetSimilarity())
sim_and_search(SameCharsSimilarity())
sim_and_search(SequenceMatcherSimilarity())
2023-09-11 03:36:11.670 | INFO | similarities.literalsim:add_corpus:75 - Start computing corpus embeddings, new docs: 6
text1: ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
text2: ['花呗更改绑定银行卡', '我什么时候开通了花呗']
query: ['我的花呗开通了?', '乌克兰被俄罗斯警告', '更改绑定银行卡']
Similarity: SimHashSimilarity, matching_model: SimHash
sim scores: [0.9375, 0.5]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9375
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.5000
Computing corpus SimHash: 100%|██████████| 6/6 [00:00<00:00, 3128.91it/s]
2023-09-11 03:36:11.675 | INFO | similarities.literalsim:add_corpus:84 - Add 6 docs, total: 6, emb size: 6
sim search: {0: {3: 0.703125, 5: 0.5625, 1: 0.515625}, 1: {0: 0.78125, 1: 0.484375, 2: 0.484375}, 2: {4: 1.0, 1: 0.59375, 2: 0.59375}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 0.7031
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.5625
人在巴基斯坦基地的炸弹袭击中丧生: 0.5156
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 0.7812
人在巴基斯坦基地的炸弹袭击中丧生: 0.4844
中央情报局局长访问以色列叙利亚会谈: 0.4844
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.5938
中央情报局局长访问以色列叙利亚会谈: 0.5938
--------------------------------------------------
Similarity: TfidfSimilarity, matching_model: Tfidf
2023-09-11 03:36:12.649 | INFO | similarities.literalsim:add_corpus:238 - Start computing corpus embeddings, new docs: 6
sim scores: tensor([[0.7948, 0.4022],
[1.0000, 0.4048]], dtype=torch.float64)
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.7948
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.4048
Computing corpus TFIDF: 100%|██████████| 6/6 [00:00<00:00, 23.17it/s]
2023-09-11 03:36:12.911 | INFO | similarities.literalsim:add_corpus:247 - Add 6 docs, total: 6, emb size: 6
2023-09-11 03:36:13.461 | INFO | similarities.literalsim:add_corpus:334 - Start computing corpus embeddings, new docs: 6
2023-09-11 03:36:13.463 | INFO | similarities.literalsim:add_corpus:340 - Add 6 docs, total: 6
2023-09-11 03:36:13.465 | INFO | text2vec.word2vec:__init__:80 - Load pretrained model:w2v-light-tencent-chinese, path:/home/mw/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin
sim search: {0: {3: 0.921499490737915, 4: 0.43930041790008545, 5: 0.0}, 1: {0: 0.7380481958389282, 4: 0.0, 3: 0.0}, 2: {4: 0.8345502018928528, 3: 0.0, 5: 0.0}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 0.9215
花呗更改绑定银行卡: 0.4393
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.0000
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 0.7380
花呗更改绑定银行卡: 0.0000
我什么时候开通了花呗: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 0.8346
我什么时候开通了花呗: 0.0000
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.0000
--------------------------------------------------
Similarity: BM25Similarity, matching_model: BM25
sim search: {0: {3: 4.453010263817695, 4: 1.3720219233789517, 5: 1.010258330300517}, 1: {0: 4.245182027356298, 1: 0.0, 2: 0.0}, 2: {4: 4.549213631437518, 0: 0.0, 1: 0.0}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 4.4530
花呗更改绑定银行卡: 1.3720
暴风雨掩埋了东北部;新泽西16英寸的降雪: 1.0103
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 4.2452
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
中央情报局局长访问以色列叙利亚会谈: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 4.5492
俄罗斯警告乌克兰反对欧盟协议: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
2023-09-11 03:36:14.743 | INFO | similarities.literalsim:add_corpus:424 - Start computing corpus embeddings, new docs: 6
Similarity: WordEmbeddingSimilarity, matching_model: Word2Vec
sim scores: tensor([[0.9812, 0.8195],
[1.0000, 0.8264]], dtype=torch.float64)
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9812
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.8264
Word2Vec Embeddings: 100%|██████████| 6/6 [00:00<00:00, 7707.76it/s]
2023-09-11 03:36:14.746 | INFO | similarities.literalsim:add_corpus:431 - Add 6 docs, total: 6, emb size: 6
2023-09-11 03:36:14.756 | INFO | similarities.literalsim:add_corpus:804 - Start add new docs: 6
2023-09-11 03:36:14.757 | INFO | similarities.literalsim:add_corpus:805 - Add 6 docs, total: 6
2023-09-11 03:36:14.761 | INFO | similarities.literalsim:add_corpus:900 - Start add new docs: 6
2023-09-11 03:36:14.762 | INFO | similarities.literalsim:add_corpus:901 - Add 6 docs, total: 6
sim search: {0: {3: 0.8737779259681702, 4: 0.7954849004745483, 5: 0.713451623916626}, 1: {0: 0.9661487936973572, 1: 0.811479926109314, 2: 0.7922273278236389}, 2: {4: 0.9858745336532593, 2: 0.819598913192749, 5: 0.8008757829666138}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 0.8738
花呗更改绑定银行卡: 0.7955
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.7135
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 0.9661
人在巴基斯坦基地的炸弹袭击中丧生: 0.8115
中央情报局局长访问以色列叙利亚会谈: 0.7922
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 0.9859
中央情报局局长访问以色列叙利亚会谈: 0.8196
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.8009
--------------------------------------------------
Similarity: SameCharsSimilarity, matching_model: SameChars
sim scores: [0.8888888888888888, 0.2222222222222222]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8889
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.2222
sim search: {0: {3: 0.75, 4: 0.25, 5: 0.25}, 1: {0: 0.8888888888888888, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 0.7500
花呗更改绑定银行卡: 0.2500
暴风雨掩埋了东北部;新泽西16英寸的降雪: 0.2500
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 0.8889
人在巴基斯坦基地的炸弹袭击中丧生: 0.1111
中央情报局局长访问以色列叙利亚会谈: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
俄罗斯警告乌克兰反对欧盟协议: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
Similarity: SequenceMatcherSimilarity, matching_model: SequenceMatcher
sim scores: [0.5555555555555556, 0.2222222222222222]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.5556
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.2222
sim search: {0: {3: 0.375, 4: 0.25, 1: 0.125}, 1: {0: 0.5555555555555556, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}
query: 我的花呗开通了?
search top 3:
我什么时候开通了花呗: 0.3750
花呗更改绑定银行卡: 0.2500
人在巴基斯坦基地的炸弹袭击中丧生: 0.1250
query: 乌克兰被俄罗斯警告
search top 3:
俄罗斯警告乌克兰反对欧盟协议: 0.5556
人在巴基斯坦基地的炸弹袭击中丧生: 0.1111
中央情报局局长访问以色列叙利亚会谈: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
俄罗斯警告乌克兰反对欧盟协议: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
2. 图像相似度计算和匹配搜索
支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索。
自行去huggingface下载模型到和鲸社区里即可:OFA-Sys/chinese-clip-vit-base-patch16 ,这里就不过展示了。
# import glob
# import sys
# from PIL import Image
# sys.path.append('/home/mw/project/similarities-main/similarities')
# from imagesim import ImageHashSimilarity, SiftSimilarity,ClipSimilarity
# def sim_and_search(m):
# print(m)
# # similarity
# sim_scores = m.similarity(imgs1, imgs2)
# print('sim scores: ', sim_scores)
# for (idx, i), j in zip(enumerate(image_fps1), image_fps2):
# s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
# print(f"{i} vs {j}, score: {s:.4f}")
# # search
# m.add_corpus(corpus_imgs)
# queries = imgs1
# res = m.most_similar(queries, topn=3)
# print('sim search: ', res)
# for q_id, c in res.items():
# print('query:', image_fps1[q_id])
# print("search top 3:")
# for corpus_id, s in c.items():
# print(f'\t{m.corpus[corpus_id].filename}: {s:.4f}')
# print('-' * 50 + '\n')
# def clip_demo():
# m = ClipSimilarity()
# print(m)
# # similarity score between text and image
# image_fps = [
# '/home/mw/project/similarities-main/examples/data/image3.png', # yellow flower image
# '/home/mw/project/similarities-main/examples/data/image1.png', # tiger image
# ]
# texts = ['a yellow flower', '老虎', '一头狮子', '玩具车']
# imgs = [Image.open(i) for i in image_fps]
# sim_scores = m.similarity(imgs, texts)
# print('sim scores: ', sim_scores)
# for idx, i in enumerate(image_fps):
# for idy, j in enumerate(texts):
# s = sim_scores[idx][idy]
# print(f"{i} vs {j}, score: {s:.4f}")
# print('-' * 50 + '\n')
# if __name__ == "__main__":
# image_fps1 = ['/home/mw/project/similarities-main/examples/data/image1.png', '/home/mw/project/similarities-main/examples/data/image3.png']
# image_fps2 = ['/home/mw/project/similarities-main/examples/data/image12-like-image1.png', '/home/mw/project/similarities-main/examples/data/image10.png']
# imgs1 = [Image.open(i) for i in image_fps1]
# imgs2 = [Image.open(i) for i in image_fps2]
# corpus_fps = glob.glob('data/*.jpg') + glob.glob('data/*.png')
# corpus_imgs = [Image.open(i) for i in corpus_fps]
# # 1. image and text similarity
# clip_demo()
# # 2. image and image similarity score
# sim_and_search(ClipSimilarity()) # the best result
# sim_and_search(ImageHashSimilarity(hash_function='phash'))
# sim_and_search(SiftSimilarity())
Similarity: ClipSimilarity, matching_model: CLIPModel
sim scores: tensor([[0.9580, 0.8654],
[0.6558, 0.6145]])
data/image1.png vs data/image12-like-image1.png, score: 0.9580
data/image3.png vs data/image10.png, score: 0.6145
sim search: {0: {6: 0.9999999403953552, 0: 0.9579654932022095, 4: 0.9326782822608948}, 1: {8: 0.9999997615814209, 4: 0.6729235649108887, 0: 0.6558331847190857}}
query: data/image1.png
search top 3:
data/image1.png: 1.0000
data/image12-like-image1.png: 0.9580
data/image8-like-image1.png: 0.9327
sim scores: tensor([[0.3220, 0.2409],
[0.1677, 0.2959]])
data/image3.png vs a yellow flower, score: 0.3220
data/image1.png vs 老虎, score: 0.2112
更多优质内容请关注公号:汀丶人工智能;会提供一些相关的资源和优质文章,免费获取阅读。
项目链接
项目链接fork一下即可运行含码源:文本语义匹配搜索快速上手baseline