scient
scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。
scient源码和编译安装包可以在Python package index
获取。
The source code and binary installers for the latest released version are available at the [Python package index].
https://pypi.org/project/scient
可以用pip
安装scient
。
You can install scient
like this:
pip install scient
也可以用setup.py
安装。
Or in the scient
directory, execute:
python setup.py install
scient.neuralnet
神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。
scient.neuralnet.transformer
实现了多个Transformer模型,包括Transformer、T5Transformer、ViTransformer、DecodeTransformer、Encoder、Decoder。
scient.neuralnet.transformer.T5Transformer(vocab_size: int, seq_len: int = 512, embed_size: int = 768,
n_head: int = 12, n_encode_layer: int = 12, n_decode_layer: int = 12, n_bucket: int = 32,
max_dist: int = 128, norm_first: bool = True, bias: bool = False, attn_scale: bool = False,
**kwargs)
Parameters
- vocab_size : int
字典规模. - seq_len : int, optional
序列长度. The default is 512. - embed_size : int, optional
embedding向量长度. The default is 768. - n_head : int, optional
multi_head_attention的head数量. The default is 12. - n_encode_layer : int, optional
编码层数. The default is 12. - n_decode_layer : int, optional
解码层数. The default is 12. - n_bucket : int, optional
multi_head_attention中相对位置编码的分桶数量. The default is 32. - max_dist : int, optional
multi_head_attention中相对位置编码的最大距离. The default is 128. - norm_first : bool, optional
在每一个编码/解码层中是否先进行Batch Normalization. The default is True. - bias : bool, optional
模型中的参数是否bias. The default is False. - attn_scale : bool, optional
multi_head_attention中是否需要对注意力矩阵进行scale. The default is False. - kwargs : 其它参数,kwargs中的参数将被传递到Encoder层和Decoder层。
Algorithms
T5采用了相对位置分桶(relative_position_bucket)的方式来处理位置编码。
在双向注意力的Encoder阶段,相对位置分桶的公式为:
在单向注意力的Decoder阶段,相对位置分桶的公式为:
式中的 n b n_b nb为相对位置编码的分桶数量n_bucket, m a x _ d i s t a n c e max\_distance max_distance为相对位置编码的最大距离max_dist。
T5模型结构
T5Transformer(
(encoder_position): BucketPosition(
(projection): Embedding(32, 12)
)
(decoder_position): BucketPosition(
(projection): Embedding(32, 12)
)
(embedding): Embedding(32128, 768)
(encoder): ModuleList(
(0-11): 12 x Encoder(
(multi_head_attn): MultiHead(
(dropout): Dropout(p=0.1, inplace=False)
(query): Linear(in_features=768, out_features=768, bias=False)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=False)
(linear): Linear(in_features=768, out_features=768, bias=False)
)
(feedforward): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=False)
(1): ReLU()
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=3072, out_features=768, bias=False)
)
(layernorm1): T5LayerNorm()
(layernorm2): T5LayerNorm()
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
(decoder): ModuleList(
(0-11): 12 x Decoder(
(mask_multi_head_attn): MultiHead(
(dropout): Dropout(p=0.1, inplace=False)
(query): Linear(in_features=768, out_features=768, bias=False)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=False)
(linear): Linear(in_features=768, out_features=768, bias=False)
)
(multi_head_attn): MultiHead(
(dropout): Dropout(p=0.1, inplace=False)
(query): Linear(in_features=768, out_features=768, bias=False)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=False)
(linear): Linear(in_features=768, out_features=768, bias=False)
)
(feedforward): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=False)
(1): ReLU()
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=3072, out_features=768, bias=False)
)
(layernorm1): T5LayerNorm()
(layernorm2): T5LayerNorm()
(layernorm3): T5LayerNorm()
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
(encoder_layernorm): T5LayerNorm()
(decoder_layernorm): T5LayerNorm()
(linear): Linear(in_features=768, out_features=32128, bias=False)
)
Examples
下面的代码实例是训练一个“对句子进行重写,且不改变语义”的模型,比如“鹿跳过篱笆。”可重写成“一只鹿跳过篱笆。”。
import torch
from scient.neuralnet import transformer,fit
from scient.neuralnet import dataset
import sentencepiece
import pandas
from tqdm import tqdm
tqdm.pandas()
data_path='d:\\rewrite_train3.xlsx'
tokenizer_path='d:\\spiece.model'
#%%model
vocab_size=32128
seq_len_upper=32
tokenizer=sentencepiece.SentencePieceProcessor(tokenizer_path)
model=transformer.T5Transformer(vocab_size=vocab_size,dropout=0.1,ffn_size=3072)
#%% 数据
data=pandas.read_excel(data_path)
#tokenize
data['source_token']=data['input'].progress_apply(lambda x:tokenizer.encode(x))
data['target_token']=data['label'].progress_apply(lambda x:tokenizer.encode(x))
#清洗
data=data[(data['source_token'].apply(len)<seq_len_upper)&(data['target_token'].apply(len)<seq_len_upper)]
#截断
data['source_token']=data['source_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识
data['target_input_token']=data['target_token'].progress_apply(lambda x:[tokenizer.pad_id()]+x[:seq_len_upper])#增加<bos>标识,这里用pad_id作为<bos>
data['target_output_token']=data['target_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识
#mask
data['source_pad_mask']=data['source_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))
data['target_pad_mask']=data['target_input_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))
#补齐
data['source_token']=data['source_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_input_token']=data['target_input_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_output_token']=data['target_output_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
batch_size=8
#dataLoad
data_train=data.sample(frac=0.7)
data_eval=data.drop(data_train.index).sample(frac=0.7)
data_val=data.drop(data_train.index).drop(data_eval.index)
train_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_train,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=True)
eval_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_eval,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=False)
val_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_val,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=1,shuffle=False)
#%% 训练
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")
#损失函数
loss_func_ = torch.nn.CrossEntropyLoss(ignore_index=0)
def loss_func(y_hat,y):
return loss_func_(y_hat.reshape(-1, vocab_size),y.reshape(-1).to(torch.int64).to(device)) # 计算损失
#优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-4)
def perform_func(y_hat,y):#perform_func的输入是预测值y_hat和实际值y
y_hat,y=torch.concat(y_hat).reshape(-1, vocab_size).numpy(),torch.concat(y).reshape(-1).numpy()#先将y_hat和y分别concat,由于y_hat和y是按loader分批计算和收集的,所以y_hat和y是batch_size大小的多个对象组成的list
y_hat=y_hat.argmax(axis=1)
y_hat=y_hat[y!=0]
y=y[y!=0]
return round((y_hat==y).sum()/len(y),4)#输出准确率,并保留4位小数
model=fit.set(model,optimizer=optimizer,loss_func=loss_func,perform_func=perform_func,device=device,n_iter=5)
model.fit(train_loader,eval_loader,mode=('inputs','target'))
附代码中用到的tokenizer模型spiece.model和训练数据rewrite_train3.xlsx的下载地址:
链接:https://pan.baidu.com/s/12vEZBYldXvPrJTiFUEKGUw?pwd=DTFM
提取码:DTFM
通过5轮训练,模型在训练集和测试集上的准确率均已达到99%以上。
train iter 0: avg_batch_loss=3.88477 perform=0.5023: 100%|██████████| 140/140 [06:43<00:00, 2.88s/it]
eval iter 0: avg_batch_loss=0.56695 perform=0.8973: 100%|██████████| 42/42 [00:28<00:00, 1.47it/s]
train iter 1: avg_batch_loss=0.27674 perform=0.9539: 100%|██████████| 140/140 [08:02<00:00, 3.45s/it]
eval iter 1: avg_batch_loss=0.08557 perform=0.9808: 100%|██████████| 42/42 [00:46<00:00, 1.10s/it]
train iter 2: avg_batch_loss=0.05592 perform=0.9897: 100%|██████████| 140/140 [09:33<00:00, 4.10s/it]
eval iter 2: avg_batch_loss=0.01999 perform=0.9957: 100%|██████████| 42/42 [00:28<00:00, 1.45it/s]
train iter 3: avg_batch_loss=0.02244 perform=0.9964: 100%|██████████| 140/140 [07:58<00:00, 3.42s/it]
eval iter 3: avg_batch_loss=0.01343 perform=0.996: 100%|██████████| 42/42 [00:32<00:00, 1.31it/s]
train iter 4: avg_batch_loss=0.01273 perform=0.9981: 100%|██████████| 140/140 [07:44<00:00, 3.32s/it]
eval iter 4: avg_batch_loss=0.01047 perform=0.9977: 100%|██████████| 42/42 [00:29<00:00, 1.41it/s]
采用训练好的模型对data_val数据集进行预测
#%%
# 验证
model.eval()
progressbar = tqdm(val_loader)#这里batch_size必须为1
preds=[]
with torch.no_grad():
for index,((source,target_input,source_pad_mask,target_input_pad_mask),target_output) in enumerate(progressbar):
# break
memory=model.encode(source.to(torch.int64).to(device),source_pad_mask.to(device))
pred=torch.tensor([[tokenizer.pad_id()]])#bos
while True:
pred_mask=torch.zeros_like(pred).to(torch.bool)
decode = model.decode(pred.to(torch.int64).to(device),memory,target_pad_mask=pred_mask.to(device))
output=model.linear(decode)
_,ids = output.max(dim=-1)
if ids[0,-1]==tokenizer.eos_id():#eos
break
if pred.size(1)>seq_len_upper-1:
break
pred=torch.cat([pred.to(device),ids[:,-1:]],dim=-1)
preds+=pred.tolist()
data_val['target_output_pred']=preds
data_val['target_pred']=data_val['target_output_pred'].progress_apply(lambda x:tokenizer.decode(x))
预测结果
input是输入,label是期望模型输出的内容,target_pred是模型输出的内容,可以看到模型输出与期望之间基本一致。
值得注意的是第一条数据,模型将
机构认为,随着经济数据及上市公司财报的披露,预计市场主线将逐渐清晰,中大盘成长股有望成为下一阶段的资金偏好。
改写成
他们说,第二季度的收益报告将是给予投资者这种指导的关键。
虽然与期望输出的内容差距较大,但是模型输出的意思却是完全正确的,难道这就是模型涌现出的创造力?