[NLP] SentenceTransformers使用介绍

news2024/12/23 13:58:24

SentenceTransformers 是一个可以用于句子、文本和图像嵌入的Python库。 可以为 100 多种语言计算文本的嵌入并且可以轻松地将它们用于语义文本相似性、语义搜索和同义词挖掘等常见任务。

该框架基于 PyTorch 和 Transformers,并提供了大量针对各种任务的预训练模型。 还可以很容易根据自己的模型进行微调。

阅读论文 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,深入了解模型的训练方式。

在本文中,我们将看到该库的一些可能用例的代码示例。 模型训练将在后面的文章中介绍。

安装

在深入研究代码之前,使用pip安装sentencetransformer库。

sentence-transformers (Sentence Transformers) (huggingface.co)

pip install -U sentence-transformers

Sentence-Transformers使用方法介紹

Sentence-Transformers使用分兩個部分

  1. 使用Pretrained Model進行Inference
  2. Train你自己的BERT的model

因为BERT embedding的维度动辄就是500, 700,后续运算不易,一般会直接用PCA来做dimension reduction。而实际上Sentence-Transformers有很多已经针对特定问题预训练好的Model了,后面举两个例子。

使用Pretrained Model进行Inference

获得嵌入向量

 语义文本相似度

一旦我们有了句子的嵌入,我们就可以使用util模块中的cos_sim函数来计算它们的余弦相似度。

取值范围为[-1:1],取值为-1表示完全不相似,取值为1表示完全相似

语义搜索

语义搜索通过理解搜索查询的内容来提高搜索的准确性,而不是仅仅依赖于词汇匹配。这是利用嵌入之间的相似性完成的。

语义搜索是将语料库中的所有条目嵌入到向量空间中。在搜索时,查询也会被嵌入到相同的向量空间中,并从语料库中找到最接近的嵌入。

向量空间中语义搜索的例子。

语义搜索可以使用util模块的semantic_search函数来执行,该函数处理语料库中的文档的嵌入和查询的嵌入。

 

为了充分利用语义搜索,必须区分对称和非对称语义搜索,因为它会严重影响要使用的模型的选择。

Paraphrase Mining

Paraphrase Mining是在大量句子中寻找释义的任务,即具有非常相似含义的文本。

这可以使用 util 模块的 paraphrase_mining 函数来实现。

图片搜索

Image Search = 給定文字敘述,看跟圖像內容是否相似。

使用’clip-ViT-B-32'这个model就可以同时encode图片跟文字,并且两个出来的embedding可以直接算similarity,得到文字叙述跟图像内容的相似程度

SentenceTransformers 提供允许将图像和文本嵌入到同一向量空间,通过这中模型可以找到相似的图像以及实现图像搜索,即使用文本搜索图像,反之亦然。

同一向量空间中的文本和图像示例。

要执行图像搜索,需要加载像 CLIP 这样的模型,并使用其encode 方法对图像和文本进行编码。

多模态模型获得的嵌入也允许执行图像相似性等任务。 

Extractive Summarization

给文章,Model从里面选几句话当成这个文章的摘要可以直接参考官方的example

"""
This example uses LexRank (https://www.aaai.org/Papers/JAIR/Vol22/JAIR-2214.pdf)
to create an extractive summarization of a long document.
The document is splitted into sentences using NLTK, then the sentence embeddings are computed. We
then compute the cosine-similarity across all possible sentence pairs.
We then use LexRank to find the most central sentences in the document, which form our summary.
Input document: First section from the English Wikipedia Section
Output summary:
Located at the southern tip of the U.S. state of New York, the city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass.
New York City (NYC), often called simply New York, is the most populous city in the United States.
Anchored by Wall Street in the Financial District of Lower Manhattan, New York City has been called both the world's leading financial center and the most financially powerful city in the world, and is home to the world's two largest stock exchanges by total market capitalization, the New York Stock Exchange and NASDAQ.
New York City has been described as the cultural, financial, and media capital of the world, significantly influencing commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports.
If the New York metropolitan area were a sovereign state, it would have the eighth-largest economy in the world.
"""
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
from LexRank import degree_centrality_scores



model = SentenceTransformer('paraphrase-distilroberta-base-v1')

# Our input document we want to summarize
# As example, we take the first section from Wikipedia
document = """
New York City (NYC), often called simply New York, is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2), New York City is also the most densely populated major city in the United States. Located at the southern tip of the U.S. state of New York, the city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass. With almost 20 million people in its metropolitan statistical area and approximately 23 million in its combined statistical area, it is one of the world's most populous megacities. New York City has been described as the cultural, financial, and media capital of the world, significantly influencing commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports. Home to the headquarters of the United Nations, New York is an important center for international diplomacy.
Situated on one of the world's largest natural harbors, New York City is composed of five boroughs, each of which is a county of the State of New York. The five boroughs—Brooklyn, Queens, Manhattan, the Bronx, and Staten Island—were consolidated into a single city in 1898. The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York is home to more than 3.2 million residents born outside the United States, the largest foreign-born population of any city in the world as of 2016. As of 2019, the New York metropolitan area is estimated to produce a gross metropolitan product (GMP) of $2.0 trillion. If the New York metropolitan area were a sovereign state, it would have the eighth-largest economy in the world. New York is home to the highest number of billionaires of any city in the world.
New York City traces its origins to a trading post founded by colonists from the Dutch Republic in 1624 on Lower Manhattan; the post was named New Amsterdam in 1626. The city and its surroundings came under English control in 1664 and were renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was subsequently renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York has emerged as a global node of creativity, entrepreneurship, and environmental sustainability, and as a symbol of freedom and cultural diversity. In 2019, New York was voted the greatest city in the world per a survey of over 30,000 people from 48 cities worldwide, citing its cultural diversity.
Many districts and landmarks in New York City are well known, including three of the world's ten most visited tourist attractions in 2013. A record 62.8 million tourists visited New York City in 2017. Times Square is the brightly illuminated hub of the Broadway Theater District, one of the world's busiest pedestrian intersections, and a major center of the world's entertainment industry. Many of the city's landmarks, skyscrapers, and parks are known around the world. Manhattan's real estate market is among the most expensive in the world. Providing continuous 24/7 service and contributing to the nickname The City that Never Sleeps, the New York City Subway is the largest single-operator rapid transit system worldwide, with 472 rail stations. The city has over 120 colleges and universities, including Columbia University, New York University, Rockefeller University, and the City University of New York system, which is the largest urban public university system in the United States. Anchored by Wall Street in the Financial District of Lower Manhattan, New York City has been called both the world's leading financial center and the most financially powerful city in the world, and is home to the world's two largest stock exchanges by total market capitalization, the New York Stock Exchange and NASDAQ.
"""

#Split the document into sentences
sentences = nltk.sent_tokenize(document)
print("Num sentences:", len(sentences))

#Compute the sentence embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute the pair-wise cosine similarities
cos_scores = util.pytorch_cos_sim(embeddings, embeddings).numpy()

#Compute the centrality for each sentence
centrality_scores = degree_centrality_scores(cos_scores, threshold=None)

#We argsort so that the first element is the sentence with the highest score
most_central_sentence_indices = np.argsort(-centrality_scores)


#Print the 5 sentences with the highest scores
print("\n\nSummary:")
for idx in most_central_sentence_indices[0:5]:
    print(sentences[idx].strip())

Sentence-Transformers处理中文


如果要使用Sentence-Transformers处理中文的话,就要load裡面的Multilingual Model近来,可以参考下面页面。

常用的是paraphrase-xlm-r-multilingual-v1这个Model,不过大小1G,相对比较臃肿

model = SentenceTransformer('paraphrase-xlm-r-multilingual-v1')
其他像是distiluse-base-multilingual-cased-v2效果也非常好。

使用在非英文上的时候要特别注意,即便你今天使用的Model没有支持你需要的语言,他还是会产出Embedding,只是是非常非常差的Embedding。

所以一般需要用非常简单的word pair去测试现在的Embedding有没有正确运作。(这一步非常重要,针对特定Project我会整理出越来越复杂的Pair来测试)

下面给范例

其他任务

1、对于问答检索等复杂的搜索任务,可以通过使用 Retrieve & Re-Rank 显著改进语义搜索。

 Retrieve & Re-Rank 架构

2、SentenceTransformers 可以以不同的方式用于对小或大的句子集进行聚类。

model.encode参数

sentence – 要嵌入的句子

batch_size – 用于计算的批大小

show_progress_bar – 对句子进行编码时输出进度条

output_value – 默认sentence_embedding,用于获取句子嵌入。可以设置为token_embeddings以获取字片标记嵌入。设置为"无",以获取所有输出值

convert_to_numpy – 如果为 true,则输出为 numpy 向量的列表。否则,它是一个 pytorch 张量列表。

convert_to_tensor – 如果为 true,将返回一个tensor大张量。覆盖convert_to_numpy中的任何设置

normalize_embeddings – 如果设置为 true,则返回的向量的长度将为 1。在这种情况下,可以使用更快的点积(util.dot_score)而不是余弦相似性。

需要注意的点:

1.输入的多个句子放在列表里;

2.也可以输入短语或者长句子,超过最大句长会被截断,最大句长为512个单词片段,约为300-400单词(以英语单词来衡量)

 

对文档进行主题建模示例

最后 SentenceTransformers的官网:https://www.sbert.net/

作者:Fabio Chiusano

無痛使用超強NLP model — BERT. Sentence Transformers 使用方法介紹 | by 倢愷 Oscar | Medium

5分钟 NLP系列 — SentenceTransformers 库介绍 - 知乎 (zhihu.com)

文本相似度的BERT度量方法 - 知乎 (zhihu.com)

训练一个SentenceTransformer模型 | Lowin Li

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/516254.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

STEP7-MicroWin SMART中修改变量注释的具体方法(绝对寻址+符号寻址)

STEP7-MicroWin SMART中修改变量注释的具体方法(绝对寻址+符号寻址) 如下图所示,我们可以在符号表中定义变量的符号名称以及注释信息, 使用时需注意以下事项: 1.在 STEP 7-Micro/WIN SMART 软件中,可以建立多个符号表,但不允许将相同的符号名多次用作全局符号赋值,在单…

1707_Python中的多成员处理

全部学习汇总: GreyZhang/python_basic: My learning notes about python. (github.com) 欢迎路过的YUAN类朋友们,希望我们能够相互交流共同成长。如有错误或者不足希望及时指点指出,不胜感激!以下是我的联系方式: E…

Kali-linux识别活跃的主机

尝试渗透测试之前,必须先识别在这个目标网络内活跃的主机。在一个目标网络内,最简单的方法将是执行ping命令。当然,它可能被一个主机拒绝,也可能被接收。本节将介绍使用Nmap工具识别活跃的主机。 网络映射器工具Nmap Nmap是一个…

JavaScript经典教程(七)-- JavaScript中级

197:in、预解析、变量提升、对象引用、Date对象 1、预解析 即,把var的变量在,作用域下,提前; (1)JS代码运行原理 预先解析,JS第一次解析代码叫预解析。 JS本身会解析两次代码&a…

vue非单文件组件

非单文件组件指的是:一个文件中包含了多个组件。 Vue 中使用组件的三大步骤:1. 创建组件、2. 注册组件、3. 使用组件。 组件使用流程【第一步:创建组件】 利用 Vue.extend() 方法创建组件: // 第一步:创建 frameHead 组件 cons…

前端015_标签模块_删除功能

标签模块_删除功能 1、需求分析2、EasyMock 添加模拟接口3、Api 调用接口4、测试1、需求分析 当点击删除按钮后, 弹出提示框。点击确定后,执行删除并刷新列表数据 确认消息弹框参考:https://element.eleme.cn/#/zh-CN/component/message-box#que-ren-xiao-xi 2、EasyMock …

【AUTOSAR】【以太网】TCPIP

目录 一、概述 二、约束和假设 三、依赖模块 3.1 EthIf 3.2 EthSM 3.3 SoAd 3.4 KeyM 3.5 CSM 四、功能说明 4.1 系统扩展性 4.2 IPv4 4.2.1 IPv4 4.2.2 ARP 4.2.3 Auto-IP 4.2.4 ICMP 4.3 IPv6 4.4 IPSec 4.5 基于IP的协议 4.5.1 本地地址表 4.5.2 UDP 4…

音视频八股文(12)-- ffmpeg 音频重采样

1重采样 1.1 什么是重采样 所谓的重采样,就是改变⾳频的采样率、sample format、声道数等参数,使之按照我们期望的参数输出。 1.2 为什么要重采样 为什么要重采样?当然是原有的⾳频参数不满⾜我们的需求,⽐如在FFmpeg解码⾳频…

【C++初阶】类和对象(四)

​ ​📝个人主页:Sherry的成长之路 🏠学习社区:Sherry的成长之路(个人社区) 📖专栏链接:C初阶 🎯长路漫漫浩浩,万事皆有期待 上一篇博客:【C初阶】…

(二)zookeeper实战——zookeeper集群搭建

前言 本节内容我们主要介绍一下如何在centos系统下搭建一套高可用的zookeeper集群,zookeeper是我们常用的中间键之一,例如使用zookeeper实现分布式锁、Hadoop集群高可用、kafka集群高可用等等。我们以以下三台服务器为例: zookeeper服务器 I…

R语言tidyverse教程:ggplot2绘图初步

文章目录 基本流程渲染美化坐标轴设置 R语言系列: 编程基础💎循环语句💎向量、矩阵和数组💎列表、数据帧排序函数💎apply系列函数tidyverse:readr💎tibble 基本流程 ggplot2有其独特的绘图语…

【算法题】LCP 74. 最强祝福力场

插: 前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家。点击跳转到网站。 坚持不懈,越努力越幸运,大家一起学习鸭~~~ 题目: 小扣在探索丛林的过程中&#xff0…

MPLS格式和802.1q帧格式

一.MPLS IETF开发的多协议标记交换(MPLS)把第2层的链路状态信息(带宽、延迟、利用率等)集成到第3层的协议数据单元中,从而简化和改进了第3层分组的交换过程 。理论上,MPLS支持任何第2层和第3层协议。MPLS包头的位置界…

web集群第一次作业

目录 一. 简述静态网页和动态网页的区别 二. 简述 Web1.0 和 Web2.0 的区别 三. 安装tomcat8,配置服务启动脚本,部署jpress应用。 一. 简述静态网页和动态网页的区别 1. 首先,两者的页面资源特征不同: 静态网页处理文件类型有…

【Linux】进程信号(完整版) --- 信号产生 信号保存 信号捕捉 可重入函数 volatile SIGCHLD信号等

🍎作者:阿润菜菜 📖专栏:Linux系统编程 文章目录 一、预备知识二、信号产生1. 通过终端按键产生信号1.1 signal()1.2 core dump标志位、核心存储文件 2.通过系统调用向进程发送信号3.由软件条件产生信号3.1 alarm函数和SIGALRM信号…

华为OD机试真题 Java 实现【知识图谱新词挖掘1】【2023Q1 100分】

一、题目描述 小华负责公司知识图谱产品,现在要通过新词挖掘完善知识图谱。 新词挖掘: 给出一个待挖掘文本内容字符串Content和一个词的字符串word,找到content中所有word的新词。 新词:使用词word的字符排列形成的字符串。 请帮小华实现新词…

【AI领域+餐饮】| 论ChatGPT在餐饮行业的应用展望

💂作者简介: THUNDER王,一名热爱财税和SAP ABAP编程以及热爱分享的博主。目前于江西师范大学会计学专业大二本科在读,同时任汉硕云(广东)科技有限公司ABAP开发顾问。在学习工作中,我通常使用偏后…

二十五、SQL 数据分析实战(9个中等难度的SQL题目)

文章目录 题目1: App 使用频率分析题目2: App 下载情况统计题目3: 寻找活跃学习者题目4: 商品分类整理题目5: 商品销售分析题目6: 网约车司机收益统计题目7: 网站登录时间间隔统计题目8: 不同区域商品收入统计题目9: 信贷逾期情况统计 题目1: App 使用频率分析 现有一张用户使…

BM64-最小花费爬楼梯

题目 给定一个整数数组 cost,其中 cost[i] 是从楼梯第i个台阶向上爬需要支付的费用,下标从0开始。一旦你支付此费用,即可选择向上爬一个或者两个台阶。 你可以选择从下标为 0 或下标为 1 的台阶开始爬楼梯。 请你计算并返回达到楼梯顶部的…

【C++初阶】类和对象(下)

一.再谈构造函数 构造函数其实分为: 1.函数体赋值 2.初始化列表 之前所讲到的构造函数其实都是函数体赋值,那么本篇文章将会具体讲述初始化列表。 初始化列表 语法 以一个冒号开始,接着是一个以逗号分隔的数据成员列表,每个"…