SentencePiece进行文本分类

SentencePieces

前言

Step1:故事

SentencePiece 是一个无监督的文本分词器和 detokenizer(还原回去的？)
主要用于词汇表大小是预定的文本生成系统中
它拓展了原始句子的训练，实现子词单元如 BPE 和 unigram language model
技术亮点
- 纯数据驱动，纯从句子中训练 tokenizer 和 detokenizer。不总是需要预训练模型
- 语言独立：把句子视为 Unicode 字符，没有语言逻辑
- 多个子词算法: BPE 和 Unigram LM
- 子词正则化：实现了子词正则化和 BPE dropout 的子词采样，有助于提高鲁棒性和准确性。
- 快、轻量级：每秒 50k 个句子，内存大概 6MB
- 自包含：相同的模型文件相同的 tokenizer
- 快速词汇 id 生成
- NFKC 的正则化
  - NFC : 组合形式，字符被标准化为单个预组合字符（合成字符）
  - NFD : 分解模型，字符被标准化为基本字符加上组合符号的形式（分解模式）—— 原始字符：é —> NFD 形式：e + ´
  - NFKC : 兼容性组合模式，类似 NFC，但在标准化过程中可能会删除某些格式化信息
  - NFKD : 兼容性分解模式，类似 NFD，但在标准化过程中可能会删除某些格式化信息
**吐槽：**这些 HF 的 tokenizers 都能做。。。。。。而且 Tokenizers 做的更多

1.什么是 SentencePiece

是子词单元的重新实现，缓解开放词汇表问题
独一无二的 token 数量是预定的，例如 8k, 16k, 32k
用未处理的句子训练：
- 以前的子词实现为了告诉训练，需要提前将输入句子 token 化。
- SentencePiece 实现很快，可以使用原始句子训练模型。这对于中文或日语很有用
空格被视为基本符号
- 原来，(word.) == (word .)
- 现在，(word.) != (word_.)
- 因为空格被保存到了句子中，所以可以不含糊的 detokenize 回去；对比原来是不可你转的
- 这让 tokenization 没有语言依赖成为了可能

2.子词正则化和 BPE Dropout

目的：用于子词分割和模型训练，旨在提高模型的泛化能力和鲁棒性
子词正则化：
- 远离：在训练时不会固定使用一种分割方法，而是从多种分割方案中，随机选择一种。增强模型应对多样性输入的能力
- 优点：引入分词的不确定性，提高鲁棒性和泛海能力。对低资源等数据较少的场景友好
BPE Dropout
- 原理：常规 BPE 中，每次选频率最高的字符进行合并，而 BPE Dropout 会随机丢弃一些合并步骤。意味着在训练中，同一个词语在不同的迭代中可能被分割成不同的子词序列。
- 优点：引入随机性，鲁棒性，饭还行。对 OOV 问题友好

3.安装

pip 安装

pip install sentencepiece

c++ 源码安装

git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo update_dyld_shared_cache
# sudo ldconfig -v --> ubuntu

Step2:使用指南

1.训练 SentencePiece 模型

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

--input: 每行一句的语料库文件。默认使用 NFKC。可以传递逗号分隔的文教列表。
--model_prefix: 输出模型名字前缀。生成 xx.model 和 xx.vocab
--vocab_size: 词汇表大小，如 8000, 16000, 32000
--character_coverage: 模型涵盖的字符数量，好的默认是 0.9995(中文或日语等丰富的字符集),小字符集可以是 1.0
--model_type: 模型类型，选择 unigram(默认), bpe, char, word

剩下的...
--input (comma separated list of input sentences)  type: std::string default: ""
--input_format (Input format. Supported format is `text` or `tsv`.)  type: std::string default: ""
--model_prefix (output model prefix)  type: std::string default: ""
--model_type (model algorithm: unigram, bpe, word or char)  type: std::string default: "unigram"
--vocab_size (vocabulary size)  type: int32 default: 8000
--accept_language (comma-separated list of languages this model can accept)  type: std::string default: ""
--self_test_sample_size (the size of self test samples)  type: int32 default: 0
--character_coverage (character coverage to determine the minimum symbols)  type: double default: 0.9995
--input_sentence_size (maximum size of sentences the trainer loads)  type: std::uint64_t default: 0
--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0)  type: bool default: true
--seed_sentencepiece_size (the size of seed sentencepieces)  type: int32 default: 1000000
--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss)  type: double default: 0.75
--num_threads (number of threads for training)  type: int32 default: 16
--num_sub_iterations (number of EM sub-iterations)  type: int32 default: 2
--max_sentencepiece_length (maximum length of sentence piece)  type: int32 default: 16
--max_sentence_length (maximum length of sentence in byte)  type: int32 default: 4192
--split_by_unicode_script (use Unicode script to split sentence pieces)  type: bool default: true
--split_by_number (split tokens by numbers (0-9))  type: bool default: true
--split_by_whitespace (use a white space to split sentence pieces)  type: bool default: true
--split_digits (split all digits (0-9) into separate pieces)  type: bool default: false
--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.)  type: bool default: false
--allow_whitespace_only_pieces (allow pieces that only contain (consecutive) whitespace tokens)  type: bool default: false
--control_symbols (comma separated list of control symbols)  type: std::string default: ""
--control_symbols_file (load control_symbols from file.)  type: std::string default: ""
--user_defined_symbols (comma separated list of user defined symbols)  type: std::string default: ""
--user_defined_symbols_file (load user_defined_symbols from file.)  type: std::string default: ""
--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage)  type: std::string default: ""
--required_chars_file (load required_chars from file.)  type: std::string default: ""
--byte_fallback (decompose unknown pieces into UTF-8 byte pieces)  type: bool default: false
--vocabulary_output_piece_score (Define score in vocab file)  type: bool default: true
--normalization_rule_name (Normalization rule name. Choose from nfkc or identity)  type: std::string default: "nmt_nfkc"
--normalization_rule_tsv (Normalization rule TSV file. )  type: std::string default: ""
--denormalization_rule_tsv (Denormalization rule TSV file.)  type: std::string default: ""
--add_dummy_prefix (Add dummy whitespace at the beginning of text)  type: bool default: true
--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace)  type: bool default: true
--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.)  type: bool default: true
--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.)  type: bool default: false
--unk_id (Override UNK (<unk>) id.)  type: int32 default: 0
--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.)  type: int32 default: 1
--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.)  type: int32 default: 2
--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.)  type: int32 default: -1
--unk_piece (Override UNK (<unk>) piece.)  type: std::string default: "<unk>"
--bos_piece (Override BOS (<s>) piece.)  type: std::string default: "<s>"
--eos_piece (Override EOS (</s>) piece.)  type: std::string default: "</s>"
--pad_piece (Override PAD (<pad>) piece.)  type: std::string default: "<pad>"
--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.)  type: std::string default: " ⁇ "
--train_extremely_large_corpus (Increase bit depth for unigram tokenization.)  type: bool default: false
--random_seed (Seed value for random generator.)  type: uint32 default: 4294967295
--enable_differential_privacy (Whether to add DP while training. Currently supported only by UNIGRAM model.)  type: bool default: false
--differential_privacy_noise_level (Amount of noise to add for DP)  type: float default: 0
--differential_privacy_clipping_threshold (Threshold for clipping the counts for DP)  type: std::uint64_t default: 0
--help (show help)  type: bool default: false
--version (show version)  type: bool default: false
--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere)  type: int default: 0

2.编码未处理的文本到 sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output
spm_encode --model=<model_file> --output_format=id < input > output

使用 --extra_options 去添加 BOS/EOS 或反向输入序列

spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

3.解码 sentence pieces/ids

spm_decode --model=<model_file> --input_format=piece < input > output
spm_decode --model=<model_file> --input_format=id < input > output

使用 --extra_options 在反向顺序中解码文本

spm_decode --extra_options=reverse < input > output

4.端到端的例子

spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
'''
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model
'''
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
'''
9 459 11 939 44 11 4 142 82 8 28 21 132 6
'''

echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
'''
I saw a girl with a telescope.
'''

5.导出词汇列表

spm_export_vocab --model=<model_file> --output=<output file>

6.例子

notebook

Step3:实验

1.数据集

酒店评论数据集，处理成每行一句的形式

在这里插入图片描述

2.使用

训练(我弄的是12800 词汇表大小)

生成了两个文件，一个是模型文件，一个是词表文件
在这里插入图片描述
词表文件如下

分词
- 直接分词就好了，因为任务是分类，不需要插入 eos 和 bos
- 分成 id
  - note : 生成的词汇表的顺序正好是对应的词 id 的自增顺序
并没有对应的词向量文件，看来还需要对这些词进行词嵌入训练，还是用fasttext好了。
- 写个脚本变成 fasttext 需要的形式
id 和词向量都有了，可以构造词嵌入矩阵了
- 对应关系是
  - fast_vec —> word : vec
  - spm_vec —> id : word
  - 构造 embedding —> id:vec
  - 1.2w 数据中有 20 行为空，不多，对空值处理为随机值吧
  - 写个脚本，然后保存为词向量的 .npy 文件，留着模型用
- 思想
  - 用 sentencepiece 作为分词器，得到一系列 id
  - 把 id 为给模型
  - 模型训练
  - 推理的时候也是 sentencepiece 分词
- 实践开始吧～
  - 代码
    上方资源处自取
  - 效果：基本收敛到了 96%
  - 30之后连同嵌入层一起微调10轮，准确率又上去了一个百分点