SentencePiece进行文本分类

news2025/1/19 11:33:59

SentencePieces

前言

Step1:故事

  • SentencePiece 是一个无监督的文本分词器和 detokenizer(还原回去的?)
  • 主要用于词汇表大小是预定的文本生成系统中
  • 它拓展了原始句子的训练,实现子词单元如 BPE 和 unigram language model
  • 技术亮点
    • 纯数据驱动,纯从句子中训练 tokenizer 和 detokenizer。不总是需要预训练模型
    • 语言独立:把句子视为 Unicode 字符,没有语言逻辑
    • 多个子词算法: BPE 和 Unigram LM
    • 子词正则化:实现了 子词正则化 和 BPE dropout 的子词采样,有助于提高鲁棒性和准确性。
    • 快、轻量级:每秒 50k 个句子,内存大概 6MB
    • 自包含:相同的模型文件相同的 tokenizer
    • 快速词汇 id 生成
    • NFKC 的正则化
      • NFC : 组合形式,字符被标准化为单个预组合字符(合成字符)
      • NFD : 分解模型,字符被标准化为基本字符加上组合符号的形式(分解模式)—— 原始字符:é —> NFD 形式:e + ´
      • NFKC : 兼容性组合模式,类似 NFC,但在标准化过程中可能会删除某些格式化信息
      • NFKD : 兼容性分解模式,类似 NFD,但在标准化过程中可能会删除某些格式化信息
  • **吐槽:**这些 HF 的 tokenizers 都能做。。。。。。而且 Tokenizers 做的更多

1.什么是 SentencePiece

  • 是子词单元的重新实现,缓解开放词汇表问题
  • 独一无二的 token 数量是预定的,例如 8k, 16k, 32k
  • 用未处理的句子训练
    • 以前的子词实现为了告诉训练,需要提前将输入句子 token 化。
    • SentencePiece 实现很快,可以使用原始句子训练模型。这对于中文或日语很有用
  • 空格被视为基本符号
    • 原来,(word.) == (word .)
    • 现在,(word.) != (word_.)
    • 因为空格被保存到了句子中,所以可以不含糊的 detokenize 回去;对比原来是不可你转的
    • 这让 tokenization 没有语言依赖成为了可能

2.子词正则化 和 BPE Dropout

  • 目的:用于子词分割和模型训练,旨在提高模型的泛化能力和鲁棒性
  • 子词正则化:
    • 远离:在训练时不会固定使用一种分割方法,而是从多种分割方案中,随机选择一种。增强模型应对多样性输入的能力
    • 优点:引入分词的不确定性,提高鲁棒性和泛海能力。对低资源等数据较少的场景友好
  • BPE Dropout
    • 原理:常规 BPE 中,每次选频率最高的字符进行合并,而 BPE Dropout 会随机丢弃一些合并步骤。意味着在训练中,同一个词语在不同的迭代中可能被分割成不同的子词序列。
    • 优点:引入随机性,鲁棒性,饭还行。对 OOV 问题友好

3.安装

  • pip 安装
pip install sentencepiece
  • c++ 源码安装
git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo update_dyld_shared_cache
# sudo ldconfig -v --> ubuntu

Step2:使用指南

1.训练 SentencePiece 模型

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
--input: 每行一句的语料库文件。默认使用 NFKC。可以传递逗号分隔的文教列表。
--model_prefix: 输出模型名字前缀。生成 xx.model 和 xx.vocab
--vocab_size: 词汇表大小,如 8000, 16000, 32000
--character_coverage: 模型涵盖的字符数量,好的默认是 0.9995(中文或日语等丰富的字符集),小字符集可以是 1.0
--model_type: 模型类型,选择 unigram(默认), bpe, char, word

剩下的...
--input (comma separated list of input sentences)  type: std::string default: ""
--input_format (Input format. Supported format is `text` or `tsv`.)  type: std::string default: ""
--model_prefix (output model prefix)  type: std::string default: ""
--model_type (model algorithm: unigram, bpe, word or char)  type: std::string default: "unigram"
--vocab_size (vocabulary size)  type: int32 default: 8000
--accept_language (comma-separated list of languages this model can accept)  type: std::string default: ""
--self_test_sample_size (the size of self test samples)  type: int32 default: 0
--character_coverage (character coverage to determine the minimum symbols)  type: double default: 0.9995
--input_sentence_size (maximum size of sentences the trainer loads)  type: std::uint64_t default: 0
--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0)  type: bool default: true
--seed_sentencepiece_size (the size of seed sentencepieces)  type: int32 default: 1000000
--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss)  type: double default: 0.75
--num_threads (number of threads for training)  type: int32 default: 16
--num_sub_iterations (number of EM sub-iterations)  type: int32 default: 2
--max_sentencepiece_length (maximum length of sentence piece)  type: int32 default: 16
--max_sentence_length (maximum length of sentence in byte)  type: int32 default: 4192
--split_by_unicode_script (use Unicode script to split sentence pieces)  type: bool default: true
--split_by_number (split tokens by numbers (0-9))  type: bool default: true
--split_by_whitespace (use a white space to split sentence pieces)  type: bool default: true
--split_digits (split all digits (0-9) into separate pieces)  type: bool default: false
--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.)  type: bool default: false
--allow_whitespace_only_pieces (allow pieces that only contain (consecutive) whitespace tokens)  type: bool default: false
--control_symbols (comma separated list of control symbols)  type: std::string default: ""
--control_symbols_file (load control_symbols from file.)  type: std::string default: ""
--user_defined_symbols (comma separated list of user defined symbols)  type: std::string default: ""
--user_defined_symbols_file (load user_defined_symbols from file.)  type: std::string default: ""
--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage)  type: std::string default: ""
--required_chars_file (load required_chars from file.)  type: std::string default: ""
--byte_fallback (decompose unknown pieces into UTF-8 byte pieces)  type: bool default: false
--vocabulary_output_piece_score (Define score in vocab file)  type: bool default: true
--normalization_rule_name (Normalization rule name. Choose from nfkc or identity)  type: std::string default: "nmt_nfkc"
--normalization_rule_tsv (Normalization rule TSV file. )  type: std::string default: ""
--denormalization_rule_tsv (Denormalization rule TSV file.)  type: std::string default: ""
--add_dummy_prefix (Add dummy whitespace at the beginning of text)  type: bool default: true
--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace)  type: bool default: true
--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.)  type: bool default: true
--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.)  type: bool default: false
--unk_id (Override UNK (<unk>) id.)  type: int32 default: 0
--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.)  type: int32 default: 1
--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.)  type: int32 default: 2
--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.)  type: int32 default: -1
--unk_piece (Override UNK (<unk>) piece.)  type: std::string default: "<unk>"
--bos_piece (Override BOS (<s>) piece.)  type: std::string default: "<s>"
--eos_piece (Override EOS (</s>) piece.)  type: std::string default: "</s>"
--pad_piece (Override PAD (<pad>) piece.)  type: std::string default: "<pad>"
--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.)  type: std::string default: " ⁇ "
--train_extremely_large_corpus (Increase bit depth for unigram tokenization.)  type: bool default: false
--random_seed (Seed value for random generator.)  type: uint32 default: 4294967295
--enable_differential_privacy (Whether to add DP while training. Currently supported only by UNIGRAM model.)  type: bool default: false
--differential_privacy_noise_level (Amount of noise to add for DP)  type: float default: 0
--differential_privacy_clipping_threshold (Threshold for clipping the counts for DP)  type: std::uint64_t default: 0
--help (show help)  type: bool default: false
--version (show version)  type: bool default: false
--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere)  type: int default: 0

2.编码未处理的文本到 sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output
spm_encode --model=<model_file> --output_format=id < input > output
  • 使用 --extra_options 去添加 BOS/EOS 或 反向输入序列
spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

3.解码 sentence pieces/ids

spm_decode --model=<model_file> --input_format=piece < input > output
spm_decode --model=<model_file> --input_format=id < input > output
  • 使用 --extra_options 在反向顺序中解码文本
spm_decode --extra_options=reverse < input > output

4.端到端的例子

spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
'''
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model
'''
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
'''
9 459 11 939 44 11 4 142 82 8 28 21 132 6
'''

echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
'''
I saw a girl with a telescope.
'''

5.导出词汇列表

spm_export_vocab --model=<model_file> --output=<output file>

6.例子

notebook

Step3:实验

1.数据集

酒店评论数据集,处理成每行一句的形式

在这里插入图片描述

2.使用

  1. 训练(我弄的是12800 词汇表大小)

    在这里插入图片描述

生成了两个文件,一个是模型文件,一个是词表文件
在这里插入图片描述
词表文件如下在这里插入图片描述

  1. 分词

    • 直接分词就好了,因为任务是分类,不需要插入 eos 和 bos

    在这里插入图片描述

    在这里插入图片描述

    • 分成 id
      • note : 生成的词汇表的顺序正好是对应的 词 id 的自增顺序

    在这里插入图片描述

    在这里插入图片描述

  2. 并没有对应的词向量文件,看来还需要对这些词进行词嵌入训练,还是用fasttext好了。

    • 写个脚本变成 fasttext 需要的形式

    在这里插入图片描述

  3. id 和 词向量都有了,可以构造词嵌入矩阵了

    • 对应关系是

      • fast_vec —> word : vec
      • spm_vec —> id : word
      • 构造 embedding —> id:vec
      • 1.2w 数据中有 20 行为 空,不多,对空值处理为随机值吧

      在这里插入图片描述

      • 写个脚本,然后保存为词向量的 .npy 文件,留着模型用

      在这里插入图片描述

    • 思想

      • 用 sentencepiece 作为分词器,得到一系列 id
      • 把 id 为给模型
      • 模型训练
      • 推理的时候也是 sentencepiece 分词
    • 实践开始吧~

      • 代码
        上方资源处自取

      • 效果:基本收敛到了 96%

        在这里插入图片描述

      • 30之后连同嵌入层一起微调10轮,准确率又上去了一个百分点

        在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2165350.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Qemu开发ARM篇-6、emmc/SD卡AB分区镜像制作

文章目录 1、AB分区镜像制作2、uboot修改3、镜像启动 在上一篇 Qemu开发ARM篇-5、buildroot制作根文件系统并挂载启动中&#xff0c;我们通过buildroot制作了根文件系统&#xff0c;并通过 SD卡的形式将其挂载到设备并成功进行了启动&#xff0c;但上一章中&#xff0c;我们的…

车载应用的多功能需求与公安、金融等行业的应用特点

随着科技的快速发展&#xff0c;车载应用的功能需求也日益多样化。除了基本的视频监控功能外&#xff0c;现代车载应用还需满足一系列高级功能&#xff0c;如无线网络视频监控、GPS卫星定位、车辆调度、语音报站、行驶信息记录以及多媒体娱乐广告播放等。这些功能在公安、金融等…

2024年数字化转型与管理国际学术会议(DTM 2024)

目录 重要信息 大会简介 大会组委 征稿主题 论文出版 会议议程 参会方式 重要信息 大会官网&#xff1a;www.icemme.org&#xff08;点击了解大会&#xff0c;投稿等详细信息&#xff09; 大会时间&#xff1a;2024年11月22-24日 大会地点&#xff1a;中国-大连 大会…

三维重建的几何评价指标

1.三维重建的几何评价指标 1.1 Chamfer Distance Geometry quality (1) Chamfer Distance&#xff08;CD&#xff09; CD衡量两组点云之间的几何差异&#xff0c;距离越小越好。 CD是一种用于衡量两个点云之间相似度的常用几何评价指标。它计算一个点云中每个点到另一个点云的…

Qt5.15和Qt6.7配置Android开发环境

最近重新安装了Qt5.15.2和Qt6.7.2,使用Qt Creator14.0.1,配置Android开发环境时又碰到了一些问题,记录如下。 1、Qt6.7.2使用AndroidStudio的JDK 因为系统原来安装了AndroidStudio2024,系统自动检测了JDK位置,点击设置SDK,可以自动安装好相应的NDK。 打开Qt Creator14…

JavaEE——多线程的状态及线程安全问题

目录 一、线程的状态 1、NEW 2、 TERMINATED 3、RUNNABLE 4、TIMED_WAITING 5、 BLOCKED 6、WAITING 二、线程安全问题 1、线程不安全的原因 2、一个线程不安全的实例 3、加锁操作 4、产生线程不安全的原因 什么是内存可见性呢&#xff1f; 解决方案&#xff1f; 5、指令重排序…

【Linux学习】1-2 新建虚拟机ubuntu环境

1.双击打开VMware软件&#xff0c;点击“创建新的虚拟机”&#xff0c;在弹出的中选择“自定义&#xff08;高级&#xff09;” 2.点击下一步&#xff0c;自动识别ubuntu光盘映像文件&#xff0c;也可以点击“浏览”手动选择&#xff0c;点击下一步 3.设置名称及密码后&#xf…

web - RequestResponse

##Request&Response 1&#xff0c;Request和Response的概述 Request是请求对象&#xff0c;Response是响应对象。这两个对象在我们使用Servlet的时候有看到&#xff1a; 此时&#xff0c;我们就需要思考一个问题request和response这两个参数的作用是什么? request:获取请…

基于微信小程序的竞赛答题小程序开发笔记(一)

开发背景调研 中小学学科答题小程序&#xff0c;适合各中小学校方&#xff0c;老师或者家长。通过互动和参与式学习&#xff0c;小程序能够通过游戏化元素提升学习的积极性和参与度&#xff0c;从而提升学习效率&#xff0c;促进学生自主学习 功能规划 分类题库&#xff1a;…

专题八_链表_算法专题详细总结

目录 链表 1.常用技巧 1&#xff09;画图&#xff01;&#xff01;&#xff01; -> 直观 形象 便于我们理解 2&#xff09;引入虚拟“头”节点 1.便于处理边界条件 2.方便我们对链表进行操作 3.不要吝啬空间&#xff0c;大胆定义变量 4.快慢双指针 1.判断链表是否…

redis学习(014 实战:黑马点评:优惠券秒杀——1人只可以下1单问题解决方案)

黑马程序员Redis入门到实战教程&#xff0c;深度透析redis底层原理redis分布式锁企业解决方案黑马点评实战项目 总时长 42:48:00 共175P 此文章包含第54p-第p55的内容 文章目录 一人一单问题分析第一种写法 查询后进行添加第二种写法 加悲观锁在用户上加悲观锁&#xff08;提…

Vue 响应式监听 Watch 最佳实践

一. 前言 上一篇文章我们学习了 watch 的基础知识&#xff0c;了解了它的基本使用方法及注意事项&#xff0c;本篇文章我们继续了解在Vue 中 响应式监听 watch 的妙用。了解 watch 的基础使用请参考上一篇文章&#xff1a; 详解 Vue 中 Watch 的使用方法及注意事项https://bl…

53 语言模型(和之后用来训练语言模型的数据集)_by《李沐:动手学深度学习v2》pytorch版

系列文章目录 文章目录 系列文章目录理论部分使用计数来建模N元语法总结 代码读取长序列数据随机采样顺序分区 小结练习 理论部分 在上一部分中&#xff0c;我们了解了如何将文本数据映射为词元&#xff0c;以及将这些词元可以视为一系列离散的观测&#xff0c;例如单词或字符…

(务必收藏)推荐市面上8款AI自动写文献综述的网站

在当前的学术研究和论文写作中&#xff0c;AI技术的应用已经变得越来越普遍。特别是在文献综述这一环节&#xff0c;AI工具能够显著提高效率并减少人工劳动。以下是市面上8款推荐的AI自动写文献综述的网站&#xff1a; 一、千笔-AIPassPaper 是一款备受好评的AI论文写作平台&…

java 框架组件

Java 框架是一系列预先编写好的、可复用的软件组件&#xff0c;它们旨在帮助开发者快速构建高质量的应用程序。Java 社区拥有众多优秀的框架&#xff0c;涵盖了从 Web 开发到大数据处理的各个领域。下面是一些流行的 Java 框架及其主要用途&#xff1a; Spring框架&#xff1a;…

基于丹摩智算部署SD3+ComfyUI文生图详解

目录 丹摩智算简介SD3ComfyUI文生图简介 SD3ComfyUI文生图部署步骤1.1、实例创建 操作步骤从HF-mirror下载SD3模型安装git安装ComfyUI 丹摩智算简介 丹摩智算官网&#xff1a;https://www.damodel.com/home 丹摩智算&#xff08;DAMODEL&#xff09;是一款专为AI应用打造的智…

网红挣钱太容易了

你看最近这个三只羊小Y哥&#xff0c;因为月饼质量问题、因为大闸蟹的问题&#xff0c;上了好多次热搜&#xff0c;掉粉了几百万。还是有很多人在赶着要买他们家的东西。 你是他的粉丝&#xff0c;他是你的屠夫。只要冠以“全网最低价”的名号&#xff0c;就会有无数的粉丝跑过…

应用层协议 --- HTTP

序言 在上一篇文章中&#xff0c;我们在应用层实现了一个非常简单的自定义协议&#xff0c;我们在我们报文的首部添加了报文的长度并且使用特定的符号分割。但是想做一个成熟&#xff0c;完善的协议是不简单的&#xff0c;今天我们就一起看看我们每天都会用到的 HTTP协议 。 UR…

华语童声璀璨新星陈千言、陈万语闪耀荣登2024年度最受媒体欢迎女歌手

华语童声璀璨新星陈千言、陈万语闪耀荣登2024年度最受媒体欢迎女歌手 近日&#xff0c;华语乐坛传来一则令人振奋的消息&#xff0c;11 岁的双胞胎姐妹花陈千言和陈万语荣获 2024 华语童声最受媒体欢迎女歌手和第 15 届华语金曲奖优秀童星两项大奖&#xff01; 陈千言和陈万语…

前缀和(2)_【模板】二维前缀和_模板

个人主页&#xff1a;C忠实粉丝 欢迎 点赞&#x1f44d; 收藏✨ 留言✉ 加关注&#x1f493;本文由 C忠实粉丝 原创 前缀和(2)_【模板】二维前缀和_模板 收录于专栏【经典算法练习】 本专栏旨在分享学习算法的一点学习笔记&#xff0c;欢迎大家在评论区交流讨论&#x1f48c; 目…