基于循环神经网络的语言模型：RNNLM、GRULM

基于循环神经网络的语言模型：RNNLM

RNNLM首次提出是在《Recurrent neural network based language model》这篇非常重要的神经网络语言模型论文种,发表于2010年。这篇论文的主要贡献是:

首次提出并实现了一种基于循环神经网络(Recurrent Neural Network)的语言模型,简称RNN语言模型。
通过在隐藏层引入循环连接来捕捉词汇序列的长程依赖关系,使模型具有更强的序列建模能力。
克服了NNLM中限制只能输入固定长度上下文，RNN支持可变长度的上下文输入。
开启了使用更加强大和复杂的神经网络来进行语言建模的研究潮流,如后续提出的LSTM语言模型等。
基于RNN的语言模型后来也在实践中得到广泛的应用,产生了重大影响。

RNNLM模型结构

$x$ : 输入层
$s$ : 隐藏/上下文/状态层
$y$ : 输出层
$x_t$ : $t$ 时刻的输入
$y_t$ : $t$ 时刻的输出，下一个词的概率分布。
$s_t$ : $t$ 时刻隐藏层的状态

模型的输入向量 $x_t$ 由当前时刻的词向量 $w_t$ 和上一时刻的状态向量 $s_{t-1}$ 组成：

$x_t = w_t + s_{t-1} \quad + : concatenate$

模型的正向计算过程如下：

$s_t^j = f(\sum_i x_t^i u_{ji}) \quad \text{(1)}$

$y_t^k = g(\sum_j s_t^j v_{kj}) \quad \text{(2)}$

$f (z)$ :为激活函数：

$\frac{1}{1 + e^{-z}} \quad \text{(3)}$

g(z), softmax函数：

$f(z_m) = \frac{e^{z_m}}{\sum_k e^{z_k}} \quad \text{(4)}$

训练细节

$s_0$ : 初始状态的初始化，采用较小的值例如0.1，当语料足够大时，这个不重要。
$w_t$ : 词的向量表示，采用one-hot编码，实践中长度在：30000 $\sim$ 200000。
状态层的大小：30 $\sim$ 500，实验证明语料越大，隐藏层越大。
初始学习率 $\alpha = 0.1$ ，损失没有显著下降减半。

误差函数

$Error_t = desired_t - y_t \quad \text{(5)}$

$desired_t$ : $t$ 时刻真实的下一个词的one-hot编码向量。
$y_t$ : 模型的预测输出。

优化

训练语料的预处理：将所有出现频率低于阈值的单词（在训练文本中）合并为一个特殊的标记。

$P(w^i_{t+1}|w_t, s_{t-1}) = \begin{equation} \left\{ \begin{aligned} & \frac{y_t^{rare}}{C_{rare}} \quad if \quad w^i_{t+1} \quad is \quad rare \\ & y_t^i \quad otherwise \\ \end{aligned} \right. \end{equation}$

模型实现：Pytorch

模型没有100%的还原RNNLM,例如词向量的表示，采用了当前比较流行的Embedding
循环连接部分分别实现RNNcell和GRUcell
GRUcell:
1. GRU有两个门结构:重置门和更新门。重置门可以决定遗忘先前的隐状态信息,更新门可以决定保留先前的隐状态信息。
2. GRU的隐状态只包含一个隐层向量,而普通RNN每一步都会生成一个隐状态向量。
3. GRU在结构上更加简单,只涉及一个隐状态向量和两个门控制向量,计算量更小。
4. 实验结果显示,与相同配置的普通RNN相比,GRU能取得更好的性能,特别是在长序列的任务上。

import os
import time
import pandas as pd
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter

# 模型参数
@dataclass
class ModelConfig:
    vocab_size: int = None  
    n_embed : int = None
    n_hidden: int = None

RNNcell

class RNNCell(nn.Module):
    """
    the job of a 'Cell' is to:
    take input at current time step x_{t} and the hidden state at the
    previous time step h_{t-1} and return the resulting hidden state
    h_{t} at the current timestep
    """
    def __init__(self, config):
        super().__init__()
        self.xh_to_h = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)

    def forward(self, xt, hprev):
        xh = torch.cat([xt, hprev], dim=1)
        ht = F.tanh(self.xh_to_h(xh))
        return ht

GRUcell

class GRUCell(nn.Module):
    """
    same job as RNN cell, but a bit more complicated recurrence formula
    that makes the GRU more expressive and easier to optimize.
    """
    def __init__(self, config):
        super().__init__()
        # input, forget, output, gate
        self.xh_to_z = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)
        self.xh_to_r = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)
        self.xh_to_hbar = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)

    def forward(self, xt, hprev):
        # first use the reset gate to wipe some channels of the hidden state to zero
        xh = torch.cat([xt, hprev], dim=1)
        r = F.sigmoid(self.xh_to_r(xh))
        hprev_reset = r * hprev
        # calculate the candidate new hidden state hbar
        xhr = torch.cat([xt, hprev_reset], dim=1)
        hbar = F.tanh(self.xh_to_hbar(xhr))
        # calculate the switch gate that determines if each channel should be updated at all
        z = F.sigmoid(self.xh_to_z(xh))
        # blend the previous hidden state and the new candidate hidden state
        ht = (1 - z) * hprev + z * hbar
        return ht

class RNN(nn.Module):

    def __init__(self, config, cell_type):
        super().__init__()
        self.vocab_size = config.vocab_size
        self.start = nn.Parameter(torch.zeros(1, config.n_hidden)) # the starting hidden state
        self.wte = nn.Embedding(config.vocab_size, config.n_embed) # token embeddings table
        if cell_type == 'rnn':
            self.cell = RNNCell(config)
        elif cell_type == 'gru':
            self.cell = GRUCell(config)
        self.lm_head = nn.Linear(config.n_hidden, self.vocab_size)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()

        # embed all the integers up front and all at once for efficiency
        emb = self.wte(idx) # (b, t, n_embed)

        # sequentially iterate over the inputs and update the RNN state each tick
        hprev = self.start.expand((b, -1)) # expand out the batch dimension
        hiddens = []
        for i in range(t):
            xt = emb[:, i, :] # (b, n_hidden)
            ht = self.cell(xt, hprev) # (b, n_hidden)
            hprev = ht
            hiddens.append(ht)

        # decode the outputs
        hidden = torch.stack(hiddens, 1) # (b, t, n_hidden)
        logits = self.lm_head(hidden)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        return logits, loss

测试数据

数据集来10k+中文外卖评价数据集：

data = pd.read_csv('./dataset/waimai_10k.csv')
data.dropna(subset='review',inplace=True)
data['review_length'] = data.review.apply(lambda x:len(x))
data.sample(5)

	label	review	review_length
2062	1	价格实惠，值得购买。	10
2372	1	很好吃，奶茶还好，天很冷，还没有凉	17
6399	0	这么好的店、宫保鸡丁竟然拿土豆充当鸡肉！20多的菜，至于吗？	30
3147	1	挺好的，豆浆很好喝～～～	12
1248	1	好吃，真的是大肘子肉	10

语料统计信息：

data = data[data.review_length <=50] # 滤掉长度超过300的评论
words = data.review.tolist()
chars = sorted(list(set(''.join(words))))    
max_word_length = max(len(w) for w in words)

print(f"number of examples: {len(words)}")
print(f"max word length: {max_word_length}")
print(f"size of vocabulary: {len(chars)}")

number of examples: 10796
max word length: 50
size of vocabulary: 2272

划分训练/测试数据

test_set_size = min(1000, int(len(words) * 0.1)) 
rp = torch.randperm(len(words)).tolist()
train_words = [words[i] for i in rp[:-test_set_size]]
test_words = [words[i] for i in rp[-test_set_size:]]
print(f"split up the dataset into {len(train_words)} training examples and {len(test_words)} test examples")

split up the dataset into 9796 training examples and 1000 test examples

构造字符数据集[tensor]

< BLANK> : 0
token seqs : [1, 2, 3, 4, 5, 6]
x : [0, 1, 2, 3, 4, 5, 6]
y : [1, 2, 3, 4, 5, 6, 0]

class CharDataset(Dataset):

    def __init__(self, words, chars, max_word_length):
        self.words = words
        self.chars = chars
        self.max_word_length = max_word_length
        # char-->index-->char
        self.char2i = {ch:i+1 for i,ch in enumerate(chars)}
        self.i2char = {i:s for s,i in self.char2i.items()}    

    def __len__(self):
        return len(self.words)

    def contains(self, word):
        return word in self.words

    def get_vocab_size(self):
        return len(self.chars) + 1      

    def get_output_length(self):
        return self.max_word_length + 1

    def encode(self, word):
        # char sequece ---> index sequence
        ix = torch.tensor([self.char2i[w] for w in word], dtype=torch.long)
        return ix

    def decode(self, ix):
        # index sequence ---> char sequence
        word = ''.join(self.i2char[i] for i in ix)
        return word

    def __getitem__(self, idx):
        word = self.words[idx]
        ix = self.encode(word)
        x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        x[1:1+len(ix)] = ix
        y[:len(ix)] = ix
        y[len(ix)+1:] = -1 # index -1 will mask the loss
        return x, y

数据加载器[DataLoader]

class InfiniteDataLoader:
    
    def __init__(self, dataset, **kwargs):
        train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))
        self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)
        self.data_iter = iter(self.train_loader)

    def next(self):
        try:
            batch = next(self.data_iter)
        except StopIteration: # this will technically only happen after 1e10 samples... (i.e. basically never)
            self.data_iter = iter(self.train_loader)
            batch = next(self.data_iter)
        return batch

训练模型

# 模型评估
@torch.inference_mode()
def evaluate(model, dataset, batch_size=10, max_batches=None):
    model.eval()
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)
    losses = []
    for i, batch in enumerate(loader):
        batch = [t.to('cuda') for t in batch]
        X, Y = batch
        logits, loss = model(X, Y)
        losses.append(loss.item())
        if max_batches is not None and i >= max_batches:
            break
    mean_loss = torch.tensor(losses).mean().item()
    model.train() # reset model back to training mode
    return mean_loss

环境初始化:

torch.manual_seed(seed=12345)
torch.cuda.manual_seed_all(seed=12345)

work_dir = "./Rnn_log"
os.makedirs(work_dir, exist_ok=True)
writer = SummaryWriter(log_dir=work_dir)

模型初始化：

config = ModelConfig(vocab_size=len(chars)+1,
                     n_embed=64,
                     n_hidden=128)

#model = RNN(config,cell_type='rnn')
model = RNN(config,cell_type='gru')

model.to('cuda')

RNN(
  (wte): Embedding(2273, 64)
  (cell): GRUCell(
    (xh_to_z): Linear(in_features=192, out_features=128, bias=True)
    (xh_to_r): Linear(in_features=192, out_features=128, bias=True)
    (xh_to_hbar): Linear(in_features=192, out_features=128, bias=True)
  )
  (lm_head): Linear(in_features=128, out_features=2273, bias=True)
)

初始化数据：

train_dataset = CharDataset(train_words, chars, max_word_length)
test_dataset = CharDataset(test_words, chars, max_word_length)

train_dataset[0][0].shape, train_dataset[0][1].shape

(torch.Size([51]), torch.Size([51]))

Training:

# init optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01, betas=(0.9, 0.99), eps=1e-8)
# init dataloader
batch_loader = InfiniteDataLoader(train_dataset, batch_size=128, pin_memory=True, num_workers=4)

# training loop
best_loss = None
step = 0
train_losses, test_losses = [],[]
while True:

    t0 = time.time()

    # get the next batch, ship to device, and unpack it to input and target
    batch = batch_loader.next()
    batch = [t.to('cuda') for t in batch]
    X, Y = batch
    # feed into the model
    logits, loss = model(X, Y)

    # calculate the gradient, update the weights
    model.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    # wait for all CUDA work on the GPU to finish then calculate iteration time taken
    torch.cuda.synchronize()
    t1 = time.time()

    # logging
    if step % 1000 == 0:
        print(f"step {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms")

    # evaluate the model
    if step > 0 and step % 100 == 0:
        train_loss = evaluate(model, train_dataset, batch_size=100, max_batches=10)
        test_loss  = evaluate(model, test_dataset,  batch_size=100, max_batches=10)
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        # save the model to disk if it has improved
        if best_loss is None or test_loss < best_loss:
            out_path = os.path.join(work_dir, "model.pt")
            print(f"test loss {test_loss} is the best so far, saving model to {out_path}")
            torch.save(model.state_dict(), out_path)
            best_loss = test_loss

    step += 1
    # termination conditions
    if step > 10100:
        break

step 0 | loss 7.7387 | step time 84.71ms
test loss 5.455846786499023 is the best so far, saving model to ./Rnn_log/model.pt
test loss 5.085928916931152 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.722366809844971 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.451460361480713 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.261294364929199 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.121057987213135 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.0212507247924805 is the best so far, saving model to ./Rnn_log/model.pt
test loss 3.935884475708008 is the best so far, saving model to ./Rnn_log/model.pt
test loss 3.87166166305542 is the best so far, saving model to ./Rnn_log/model.pt
step 1000 | loss 3.7037 | step time 66.99ms
.......
test loss 3.476886749267578 is the best so far, saving model to ./Rnn_log/model.pt
step 4000 | loss 2.9470 | step time 57.79ms
step 5000 | loss 2.8236 | step time 60.15ms
step 6000 | loss 2.7413 | step time 60.07ms
step 7000 | loss 2.6398 | step time 58.10ms
step 8000 | loss 2.5385 | step time 58.41ms
step 9000 | loss 2.3928 | step time 58.49ms
step 10000 | loss 2.2889 | step time 57.82ms

RNNLM vs GRULM

在这里插入图片描述

测试：评论生成器

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
    for _ in range(max_new_tokens):
        # forward the model to get the logits for the index in the sequence
        logits, _ = model(idx)
        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:,-1,:] / temperature
        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # either sample from the distribution or take the most likely element
        if do_sample:
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            _, idx_next = torch.topk(probs, k=1, dim=-1)
         
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=-1)
    return idx

def print_samples(num=13):
    # inital 0 tokens
    X_init = torch.zeros((num, 1), dtype=torch.long).to('cuda')
    steps = train_dataset.get_output_length() - 1 # -1 because we already start with <START> token (index 0)
    X_samp = generate(model, X_init, steps, top_k=None, do_sample=True).to('cuda')
    new_samples = []
    for i in range(X_samp.size(0)):
        # get the i'th row of sampled integers, as python list
        row = X_samp[i, 1:].tolist() # note: we need to crop out the first <START> token
        # token 0 is the <END> token, so we crop the output sequence at that point
        crop_index = row.index(0) if 0 in row else len(row)
        row = row[:crop_index]
        word_samp = train_dataset.decode(row)
        new_samples.append(word_samp)
    return new_samples

print_samples(num=10)

['不好吃，肥肉煎饼！不松心了！',
 '山药有两次，不过小蛋鱼还不错，肉里的不筋道少',
 '草面不值的煎饼',
 '菜给的不错，服务好，速度快，来了！绝对的是辣',
 '速度很快，不贴心吧',
 '好吃，就是量少,味道不怎么样啊',
 '菜品很喜欢，百度骑士特别棒！',
 '金针菇汉堡我喜欢,面好大馅鲜,很好吃,毕竟糊所有菜品不如以前的菜饼。',
 '蛮生的。送过小哥快被味道真差。',
 '巨好吃~！！']