Transformer经典模型实战：零基础训练一个面向中文的T5模型（Text to Text Transfer Transformer）

scient

scient一个用python实现科学计算相关算法的包，包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。

scient源码和编译安装包可以在Python package index获取。

The source code and binary installers for the latest released version are available at the [Python package index].

https://pypi.org/project/scient

可以用pip安装scient。

You can install scient like this:

pip install scient

也可以用setup.py安装。

Or in the scient directory, execute:

python setup.py install

scient.neuralnet

神经网络相关算法模块，包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。

scient.neuralnet.transformer

实现了多个Transformer模型，包括Transformer、T5Transformer、ViTransformer、DecodeTransformer、Encoder、Decoder。

scient.neuralnet.transformer.T5Transformer(vocab_size: int, seq_len: int = 512, embed_size: int = 768,
										   n_head: int = 12, n_encode_layer: int = 12, n_decode_layer: int = 12, n_bucket: int = 32,
										   max_dist: int = 128, norm_first: bool = True, bias: bool = False, attn_scale: bool = False,
										   **kwargs)

Parameters

vocab_size : int
字典规模.
seq_len : int, optional
序列长度. The default is 512.
embed_size : int, optional
embedding向量长度. The default is 768.
n_head : int, optional
multi_head_attention的head数量. The default is 12.
n_encode_layer : int, optional
编码层数. The default is 12.
n_decode_layer : int, optional
解码层数. The default is 12.
n_bucket : int, optional
multi_head_attention中相对位置编码的分桶数量. The default is 32.
max_dist : int, optional
multi_head_attention中相对位置编码的最大距离. The default is 128.
norm_first : bool, optional
在每一个编码/解码层中是否先进行Batch Normalization. The default is True.
bias : bool, optional
模型中的参数是否bias. The default is False.
attn_scale : bool, optional
multi_head_attention中是否需要对注意力矩阵进行scale. The default is False.
kwargs : 其它参数，kwargs中的参数将被传递到Encoder层和Decoder层。

Algorithms

T5采用了相对位置分桶（relative_position_bucket）的方式来处理位置编码。
在双向注意力的Encoder阶段，相对位置分桶的公式为：

在这里插入图片描述

在单向注意力的Decoder阶段，相对位置分桶的公式为：

在这里插入图片描述

式中的 $n_b$ 为相对位置编码的分桶数量n_bucket， $max\_distance$ 为相对位置编码的最大距离max_dist。

T5模型结构

T5Transformer(
  (encoder_position): BucketPosition(
    (projection): Embedding(32, 12)
  )
  (decoder_position): BucketPosition(
    (projection): Embedding(32, 12)
  )
  (embedding): Embedding(32128, 768)
  (encoder): ModuleList(
    (0-11): 12 x Encoder(
      (multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (feedforward): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=False)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
        (3): Linear(in_features=3072, out_features=768, bias=False)
      )
      (layernorm1): T5LayerNorm()
      (layernorm2): T5LayerNorm()
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
  )
  (decoder): ModuleList(
    (0-11): 12 x Decoder(
      (mask_multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (feedforward): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=False)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
        (3): Linear(in_features=3072, out_features=768, bias=False)
      )
      (layernorm1): T5LayerNorm()
      (layernorm2): T5LayerNorm()
      (layernorm3): T5LayerNorm()
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
      (dropout3): Dropout(p=0.1, inplace=False)
    )
  )
  (encoder_layernorm): T5LayerNorm()
  (decoder_layernorm): T5LayerNorm()
  (linear): Linear(in_features=768, out_features=32128, bias=False)
)

Examples

下面的代码实例是训练一个“对句子进行重写，且不改变语义”的模型，比如“鹿跳过篱笆。”可重写成“一只鹿跳过篱笆。”。

import torch
from scient.neuralnet import transformer,fit
from scient.neuralnet import dataset
import sentencepiece
import pandas
from tqdm import tqdm

tqdm.pandas()

data_path='d:\\rewrite_train3.xlsx'
tokenizer_path='d:\\spiece.model'

#%%model
vocab_size=32128
seq_len_upper=32

tokenizer=sentencepiece.SentencePieceProcessor(tokenizer_path)
model=transformer.T5Transformer(vocab_size=vocab_size,dropout=0.1,ffn_size=3072)

#%% 数据
data=pandas.read_excel(data_path)

#tokenize
data['source_token']=data['input'].progress_apply(lambda x:tokenizer.encode(x))
data['target_token']=data['label'].progress_apply(lambda x:tokenizer.encode(x))

#清洗
data=data[(data['source_token'].apply(len)<seq_len_upper)&(data['target_token'].apply(len)<seq_len_upper)]

#截断
data['source_token']=data['source_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识
data['target_input_token']=data['target_token'].progress_apply(lambda x:[tokenizer.pad_id()]+x[:seq_len_upper])#增加<bos>标识，这里用pad_id作为<bos>
data['target_output_token']=data['target_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识

#mask
data['source_pad_mask']=data['source_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))
data['target_pad_mask']=data['target_input_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))

#补齐
data['source_token']=data['source_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_input_token']=data['target_input_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_output_token']=data['target_output_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))

batch_size=8
#dataLoad
data_train=data.sample(frac=0.7)
data_eval=data.drop(data_train.index).sample(frac=0.7)
data_val=data.drop(data_train.index).drop(data_eval.index)
train_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_train,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=True)
eval_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_eval,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=False)
val_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_val,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=1,shuffle=False)
#%% 训练
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")

#损失函数
loss_func_ = torch.nn.CrossEntropyLoss(ignore_index=0)
def loss_func(y_hat,y):
    return loss_func_(y_hat.reshape(-1, vocab_size),y.reshape(-1).to(torch.int64).to(device))  # 计算损失

#优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-4)

def perform_func(y_hat,y):#perform_func的输入是预测值y_hat和实际值y
    y_hat,y=torch.concat(y_hat).reshape(-1, vocab_size).numpy(),torch.concat(y).reshape(-1).numpy()#先将y_hat和y分别concat，由于y_hat和y是按loader分批计算和收集的，所以y_hat和y是batch_size大小的多个对象组成的list
    y_hat=y_hat.argmax(axis=1)
    y_hat=y_hat[y!=0]
    y=y[y!=0]
    return round((y_hat==y).sum()/len(y),4)#输出准确率，并保留4位小数

model=fit.set(model,optimizer=optimizer,loss_func=loss_func,perform_func=perform_func,device=device,n_iter=5)
model.fit(train_loader,eval_loader,mode=('inputs','target'))

附代码中用到的tokenizer模型spiece.model和训练数据rewrite_train3.xlsx的下载地址：
链接：https://pan.baidu.com/s/12vEZBYldXvPrJTiFUEKGUw?pwd=DTFM
提取码：DTFM

通过5轮训练，模型在训练集和测试集上的准确率均已达到99%以上。

train iter 0: avg_batch_loss=3.88477 perform=0.5023: 100%|██████████| 140/140 [06:43<00:00,  2.88s/it]    
eval iter 0: avg_batch_loss=0.56695 perform=0.8973: 100%|██████████| 42/42 [00:28<00:00,  1.47it/s]    
train iter 1: avg_batch_loss=0.27674 perform=0.9539: 100%|██████████| 140/140 [08:02<00:00,  3.45s/it]    
eval iter 1: avg_batch_loss=0.08557 perform=0.9808: 100%|██████████| 42/42 [00:46<00:00,  1.10s/it]    
train iter 2: avg_batch_loss=0.05592 perform=0.9897: 100%|██████████| 140/140 [09:33<00:00,  4.10s/it]    
eval iter 2: avg_batch_loss=0.01999 perform=0.9957: 100%|██████████| 42/42 [00:28<00:00,  1.45it/s]    
train iter 3: avg_batch_loss=0.02244 perform=0.9964: 100%|██████████| 140/140 [07:58<00:00,  3.42s/it]    
eval iter 3: avg_batch_loss=0.01343 perform=0.996: 100%|██████████| 42/42 [00:32<00:00,  1.31it/s]     
train iter 4: avg_batch_loss=0.01273 perform=0.9981: 100%|██████████| 140/140 [07:44<00:00,  3.32s/it]    
eval iter 4: avg_batch_loss=0.01047 perform=0.9977: 100%|██████████| 42/42 [00:29<00:00,  1.41it/s]

采用训练好的模型对data_val数据集进行预测

#%%
# 验证
model.eval()
progressbar = tqdm(val_loader)#这里batch_size必须为1
preds=[]
with torch.no_grad():
    for index,((source,target_input,source_pad_mask,target_input_pad_mask),target_output) in enumerate(progressbar):
        # break
        memory=model.encode(source.to(torch.int64).to(device),source_pad_mask.to(device))
        pred=torch.tensor([[tokenizer.pad_id()]])#bos
        while True:
            pred_mask=torch.zeros_like(pred).to(torch.bool)
            decode = model.decode(pred.to(torch.int64).to(device),memory,target_pad_mask=pred_mask.to(device))
            output=model.linear(decode)
            _,ids = output.max(dim=-1)
            if ids[0,-1]==tokenizer.eos_id():#eos
                break
            if pred.size(1)>seq_len_upper-1:
                break
            pred=torch.cat([pred.to(device),ids[:,-1:]],dim=-1)
        preds+=pred.tolist()

data_val['target_output_pred']=preds
data_val['target_pred']=data_val['target_output_pred'].progress_apply(lambda x:tokenizer.decode(x))

预测结果

在这里插入图片描述

input是输入，label是期望模型输出的内容，target_pred是模型输出的内容，可以看到模型输出与期望之间基本一致。
值得注意的是第一条数据，模型将
机构认为，随着经济数据及上市公司财报的披露，预计市场主线将逐渐清晰，中大盘成长股有望成为下一阶段的资金偏好。
改写成
他们说,第二季度的收益报告将是给予投资者这种指导的关键。
虽然与期望输出的内容差距较大，但是模型输出的意思却是完全正确的，难道这就是模型涌现出的创造力？