SimCSE
SimCSE: Simple Contrastive Learning of Sentence Embeddings.
Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821.
1、huggingface官网下载模型
官网手动下载:princeton-nlp/sup-simcse-bert-base-uncased
也可以使用代码下载
import os
from transformers import AutoTokenizer, AutoModel
# 模型名称和本地路径
model_name = "princeton-nlp/sup-simcse-bert-base-uncased"
local_model_path = "./local-simcse-model"
# 如果本地路径不存在,则下载模型
if not os.path.exists(local_model_path):
os.makedirs(local_model_path)
# 下载并保存分词器和模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(local_model_path)
model = AutoModel.from_pretrained(model_name)
model.save_pretrained(local_model_path)
使用代码下载,我这边一直报错,提醒网络不好
OSError: We couldn’t connect to ‘https://huggingface.co’ to load this file, couldn’t find it in the cached files and it looks like princeton-nlp/sup-simcse-bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at ‘https://huggingface.co/docs/transformers/installation#offline-mode’.
2、模型下载后保存到本地文件夹
我保存在文件夹:local-simcse-model
3、使用api生成句子向量
安装sentence_transformers
pip install transformers
pip install datasets
pip install sentence-transformers
使用预训练模型生成句子向量
from sentence_transformers import SentenceTransformer, util
model_name = "princeton-nlp/sup-simcse-bert-base-uncased" # 也可以使用其他预训练模型,如 unsup-simcse-bert-base-uncased
local_model_path = "./local-simcse-model"
# 使用sentence-transformers库加载模型
# model = SentenceTransformer(model_name)
model = SentenceTransformer(local_model_path) # 换成本地模型存放路径
# 示例句子
# sentences = ["This is a sentence.", "This is another sentence."]
sentences = ["NLP算法工程师", "自然语言处理算法工程师", "计算机视觉算法工程师", "大模型算法工程师", "JAVA开发", "平面设计师"]
# 生成句子嵌入
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.shape) # torch.Size([6, 768])
# 计算句子之间的余弦相似性
cosine_similarities = util.pytorch_cos_sim(embeddings, embeddings)
print(cosine_similarities)
tensor([[1.0000, 0.8721, 0.8471, 0.8261, 0.7557, 0.6945],
[0.8721, 1.0000, 0.9919, 0.9431, 0.7118, 0.7626],
[0.8471, 0.9919, 1.0000, 0.9512, 0.6979, 0.7743],
[0.8261, 0.9431, 0.9512, 1.0000, 0.6806, 0.8203],
[0.7557, 0.7118, 0.6979, 0.6806, 1.0000, 0.6376],
[0.6945, 0.7626, 0.7743, 0.8203, 0.6376, 1.0000]])
可见,
"NLP算法工程师"和"自然语言处理算法工程师"之间的相似度是0.8721,
"NLP算法工程师"和"计算机视觉算法工程师"之间的相似度是0.8471,
"NLP算法工程师"和"大模型算法工程师"之间的相似度是0.8261,
"NLP算法工程师"和"JAVA开发"之间的相似度是 0.7557,
"NLP算法工程师"和"平面设计师"之间的相似度是0.6945,
……
参考
- Bert中的词向量各向异性具体什么意思啊?
- 文本表达:解决BERT中的各向异性方法总结
- 无监督对比学习SIMCSE理解和中文实验操作
- 文本表达进击:从Bert-flow到Bert-white、SimCSE
- 文本表达:SimCSE、ESimCSE对比与实践
- SimCSE、ESimCSE-GitHub实现