simCSE句子向量表示(1)-使用transformers API

news2025/4/5 2:10:57

SimCSE
SimCSE: Simple Contrastive Learning of Sentence Embeddings.
Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821.

1、huggingface官网下载模型

官网手动下载：princeton-nlp/sup-simcse-bert-base-uncased
在这里插入图片描述
也可以使用代码下载

import os
from transformers import AutoTokenizer, AutoModel

# 模型名称和本地路径
model_name = "princeton-nlp/sup-simcse-bert-base-uncased"
local_model_path = "./local-simcse-model"

# 如果本地路径不存在，则下载模型
if not os.path.exists(local_model_path):
    os.makedirs(local_model_path)
    # 下载并保存分词器和模型
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(local_model_path)

    model = AutoModel.from_pretrained(model_name)
    model.save_pretrained(local_model_path)

使用代码下载，我这边一直报错，提醒网络不好
OSError: We couldn’t connect to ‘https://huggingface.co’ to load this file, couldn’t find it in the cached files and it looks like princeton-nlp/sup-simcse-bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at ‘https://huggingface.co/docs/transformers/installation#offline-mode’.

2、模型下载后保存到本地文件夹

我保存在文件夹:local-simcse-model
在这里插入图片描述

3、使用api生成句子向量

安装sentence_transformers

pip install transformers
pip install datasets
pip install sentence-transformers

使用预训练模型生成句子向量

from sentence_transformers import SentenceTransformer, util

model_name = "princeton-nlp/sup-simcse-bert-base-uncased"  # 也可以使用其他预训练模型，如 unsup-simcse-bert-base-uncased
local_model_path = "./local-simcse-model"
# 使用sentence-transformers库加载模型
# model = SentenceTransformer(model_name)
model = SentenceTransformer(local_model_path) # 换成本地模型存放路径

# 示例句子
# sentences = ["This is a sentence.", "This is another sentence."]
sentences = ["NLP算法工程师", "自然语言处理算法工程师", "计算机视觉算法工程师", "大模型算法工程师", "JAVA开发", "平面设计师"]

# 生成句子嵌入
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.shape) # torch.Size([6, 768])

# 计算句子之间的余弦相似性
cosine_similarities = util.pytorch_cos_sim(embeddings, embeddings)
print(cosine_similarities)

tensor([[1.0000, 0.8721, 0.8471, 0.8261, 0.7557, 0.6945],
[0.8721, 1.0000, 0.9919, 0.9431, 0.7118, 0.7626],
[0.8471, 0.9919, 1.0000, 0.9512, 0.6979, 0.7743],
[0.8261, 0.9431, 0.9512, 1.0000, 0.6806, 0.8203],
[0.7557, 0.7118, 0.6979, 0.6806, 1.0000, 0.6376],
[0.6945, 0.7626, 0.7743, 0.8203, 0.6376, 1.0000]])
可见，
"NLP算法工程师"和"自然语言处理算法工程师"之间的相似度是0.8721，
"NLP算法工程师"和"计算机视觉算法工程师"之间的相似度是0.8471，
"NLP算法工程师"和"大模型算法工程师"之间的相似度是0.8261，
"NLP算法工程师"和"JAVA开发"之间的相似度是 0.7557，
"NLP算法工程师"和"平面设计师"之间的相似度是0.6945，
……