实例主要用于熟悉相关模型，并且练习创建一个模型的步骤：数据收集、数据预处理、构建模型、训练模型、测试模型、观察模型表现、保存模型

传送门：蓝桥云课实验

1. 实验环境
2. 实验目的
3. 相关原理
4. 实验步骤
- 4.1 数据收集
- - - 从在线商城抓取评论数据
- 4.2 数据预处理（洗数据）
- 4.3 构建模型（基于词袋模型的简单文本分类器）
- - 4.3.1 数据预处理
  - 4.3.2 构建模型
  - 4.3.3 训练模型
  - 4.3.4 绘制图像观察模型表现
  - 4.3.5 保存模型
- 4.4 构建模型（基于 RNN 的简单文本分类器）
- - 4.4.1 数据预处理
  - 4.4.2 构建模型
  - 4.4.3 训练模型
  - 4.4.4 测试模型
  - 4.4.5 保存模型
- 4.5 构建模型（基于 LSTM 的简单文本分类器）
- - 4.5.0 数据预处理（略）
  - 4.5.1 构建模型
  - 4.5.2 训练模型
  - 4.5.3 测试模型
  - 4.5.4 保存模型

1. 实验环境

Jupyter Notebook
Python 3.7
PyTorch 1.4.0

2. 实验目的

编写一个自动爬虫程序，并使用它从在线商城的大量商品评论中抓取评论文本以及分类标签（评论得分）。
将根据文本的词袋（Bag of Word）模型来对文本进行建模，然后利用一个神经网络来对这段文本进行分类。识别一段文字中的情绪，从而判断出这句话是称赞还是抱怨。

3. 相关原理

使用 Python 从网络上爬取信息的基本方法
处理语料“洗数据”的基本方法
词袋模型搭建方法
简单神经网络模型RNN 的搭建方法
简单长短时记忆网络LSTM 的搭建方法：
LSTM是RNN的一种，可以解决RNN短时记忆的不足。参考博客：LSTM（长短时记忆网络）

4. 实验步骤

# 下载本实验所需数据并解压

!wget https://labfile.oss.aliyuncs.com/courses/1073/data3.zip
!unzip data3.zip

#下载分词工具 jieba，用于在评论文本的句子中进行中文分词。
!pip install jieba

4.1 数据收集

# 导入程序所需要的程序包

#抓取网页内容用的程序包
import json
import requests

#PyTorch用的包
import torch
import torch.nn as nn
import torch.optim
from torch.autograd import Variable

# 自然语言处理相关的包
import re #正则表达式的包
import jieba #结巴分词包
from collections import Counter #搜集器，可以让统计词频更简单

#绘图、计算用的程序包
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

从在线商城抓取评论数据

（很多网站设置了反爬虫，这一步可以不做，后面会给数据文件）
productId 为商品的 id，score 为评分，page 为对应的评论翻页的页码，pageSize 为总页数。
这里，我们设定只获得3星的商品评价，即 score＝3 表示好的评分。
注意在下面每条 URL 中，都预留了 page={}，用于后面的程序指定页号。
参考的代码：

# 在指定的url处获得评论
def get_comments(url):
    comments = []
    # 打开指定页面
    resp = requests.get(url)
    resp.encoding = 'gbk'

    #如果200秒没有打开则失败
    if resp.status_code != 200:
        return []

    #获得内容
    content = resp.text
    if content:
        #获得（）括号中的内容
        ind = content.find('(')
        s1 = content[ind+1:-2]
        try:
            #尝试利用jason接口来读取内容，并做jason的解析
            js = json.loads(s1)
            #提取出comments字段的内容
            comment_infos = js['comments']
        except:
            print('error')
            return([])

        #对每一条评论进行内容部分的抽取
        for comment_info in comment_infos:
            comment_content = comment_info['content']
            str1 = comment_content + '\n'
            comments.append(str1)
    return comments

good_comments = []

# 对上述网址进行循环，并模拟翻页100次
j=0
for good_comment_url_template in good_comment_url_templates:
    for i in range(100):
        # 使用 format 指定 URL 中的页号，0~99
        url = good_comment_url_template.format(i)
        good_comments += get_comments(url)
        print('第{}条纪录，总文本长度{}'.format(j, len(good_comments)))
        j += 1
        
# 将结果存储到good.txt文件中
fw = open('data/good.txt', 'w')
fw.writelines(good_comments)
fw.close()
print('Finished')

在这里插入图片描述
再爬一颗星的负面评价

# 负向评论如法炮制
bad_comments = []

j = 0
for bad_comment_url_template in bad_comment_url_templates:
    for i in range(100):
        url = bad_comment_url_template.format(i)
        bad_comments += get_comments(url)
        print('第{}条纪录，总文本长度{}'.format(j, len(bad_comments)))
        j += 1

fw = open('data/bad.txt', 'w')
fw.writelines(bad_comments)
fw.close()
print('Finished')

在这里插入图片描述

4.2 数据预处理（洗数据）

根据数据收集步骤，爬取好的评论文本在 data/good.txt 以及 data/bad.txt 中。

# 过滤掉其中的标点符号等无意义的字符。这样可以使词袋模型预测的更加准确。
def filter_punc(sentence):
    sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\'“”《》?“]+|[+——！，。？、~@#￥%……&*（）：]+", "", sentence)  
    return(sentence)
   
# 扫描所有的文本，分词、建立词典，分出正向还是负向的评论，is_filter可以过滤是否筛选掉标点符号
def Prepare_data(good_file, bad_file, is_filter = True):
    all_words = [] #存储所有的单词
    #将分词好的评论分别放置在 pos_sentences 与 neg_sentences 中
    pos_sentences = [] #存储正向的评论
    neg_sentences = [] #存储负向的评论
    with open(good_file, 'r') as fr:
        for idx, line in enumerate(fr):
            if is_filter:
                #过滤标点符号
                line = filter_punc(line)    #过滤无意义字符
            #分词
            words = jieba.lcut(line)
            if len(words) > 0:
                all_words += words
                pos_sentences.append(words)
    print('{0} 包含 {1} 行, {2} 个词.'.format(good_file, idx+1, len(all_words)))

    count = len(all_words)
    with open(bad_file, 'r') as fr:
        for idx, line in enumerate(fr):
            if is_filter:
                line = filter_punc(line)
            words = jieba.lcut(line)
            if len(words) > 0:
                all_words += words
                neg_sentences.append(words)
    print('{0} 包含 {1} 行, {2} 个词.'.format(bad_file, idx+1, len(all_words)-count))

    #建立词典，diction的每一项为{w:[id, 单词出现次数]}   建立好了词典，词典中包含评论中出现的每一种单词。
    diction = {}      #字典
    cnt = Counter(all_words)
    for word, freq in cnt.items():
        diction[word] = [len(diction), freq]
    print('字典大小：{}'.format(len(diction)))
    return(pos_sentences, neg_sentences, diction)
    
#根据单词返还单词的编码   用“词”来查“索引号”
def word2index(word, diction):
    if word in diction:
        value = diction[word][0]
    else:
        value = -1
    return(value)

#根据编码获得单词     用“索引号”来查“词”
def index2word(index, diction):
    for w,v in diction.items():
        if v[0] == index:
            return(w)
    return(None)

#主要代码：读取包含商品评论信息的两个文本文件。
good_file = 'data/good.txt'
bad_file  = 'data/bad.txt'

pos_sentences, neg_sentences, diction = Prepare_data(good_file, bad_file, True)
st = sorted([(v[1], w) for w, v in diction.items()])
st

在这里插入图片描述

4.3 构建模型（基于词袋模型的简单文本分类器）

词袋模型：就是一种简单而有效的对文本进行向量化表示的方法。就是将一句话中的所有单词都放进一个袋子里（单词表），而忽略掉语法、语义，甚至单词之间的顺序这些所有的信息。

4.3.1 数据预处理

编写的函数，以句子为单位，将所有的积极情感的评论文本，全部转化为句子向量，并保存到数据集 dataset 中。同样对于消极情绪的评论如法炮制，全部转化为句子向量并保存到数据集 dataset 中。

#句子向量的尺寸为词典中词汇的个数，句子向量在第i位置上面的数值为第i个单词在 sentence 中出现的频率。
def sentence2vec(sentence, dictionary):
    vector = np.zeros(len(dictionary))
    for l in sentence:
        vector[l] += 1
    return(1.0 * vector / len(sentence))

# 遍历所有句子，将每一个词映射成编码
dataset = [] #数据集
labels = [] #标签
sentences = [] #原始句子，调试用

# 处理正向评论
for sentence in pos_sentences:
    new_sentence = []
    for l in sentence:
        if l in diction:
            new_sentence.append(word2index(l, diction))
    dataset.append(sentence2vec(new_sentence, diction))
    labels.append(0) #正标签为0
    sentences.append(sentence)

# 处理负向评论
for sentence in neg_sentences:
    new_sentence = []
    for l in sentence:
        if l in diction:
            new_sentence.append(word2index(l, diction))
    dataset.append(sentence2vec(new_sentence, diction))
    labels.append(1) #负标签为1
    sentences.append(sentence)

在这里需要对句子进行切分，分为：训练集、测试集、校验集（比例大概10：1：1）
校验集：主要是为了校验模型是否会产生过拟合的现象。
过程：模型训练好后，利用校验集数据检测模型表现，如果误差和训练数据差不多，则说明模型泛化能力很强，否则就是模型出现了过拟合的现象。

参考代码：

#打乱所有的数据顺序，形成数据集
# indices为所有数据下标的一个全排列
indices = np.random.permutation(len(dataset))

#重新根据打乱的下标生成数据集dataset，标签集labels，以及对应的原始句子sentences
dataset = [dataset[i] for i in indices]
labels = [labels[i] for i in indices]
sentences = [sentences[i] for i in indices]

#对整个数据集进行划分，分为：训练集、校准集和测试集，其中校准和测试集合的长度都是整个数据集的10分之一
test_size = len(dataset) // 10
train_data = dataset[2 * test_size :]
train_label = labels[2 * test_size :]

valid_data = dataset[: test_size]
valid_label = labels[: test_size]

test_data = dataset[test_size : 2 * test_size]
test_label = labels[test_size : 2 * test_size]

4.3.2 构建模型

使用 PyTorch 自带的 Sequential 命令来建立多层前馈网络。其中，Sequential 中的每一个部件都是 PyTorch 的 nn.Module 模块继承而来的对象。
在这里插入图片描述

# 一个简单的前馈神经网络，三层，第一层线性层，加一个非线性ReLU，第二层线性层，中间有10个隐含层神经元

# 输入维度为词典的大小：每一段评论的词袋模型
model = nn.Sequential(
    nn.Linear(len(diction), 10),    #线性       包含10个隐含层神经元
    nn.ReLU(),                      #非线性
    nn.Linear(10, 2),               #线性
    nn.LogSoftmax(dim=1),
)

#计算预测错误率的函数
def rightness(predictions, labels):
    """其中predictions是模型给出的一组预测结果，batch_size行num_classes列的矩阵，labels是数据之中的正确答案"""
    pred = torch.max(predictions.data, 1)[1] # 对于任意一行（一个样本）的输出值的第1个维度，求最大，得到每一行的最大元素的下标
    rights = pred.eq(labels.data.view_as(pred)).sum() #将下标与labels中包含的类别进行比较，并累计得到比较正确的数量
    return rights, len(labels) #返回正确的数量和这一次一共比较了多少元素

4.3.3 训练模型

# 损失函数为交叉熵
cost = torch.nn.NLLLoss()
# 优化算法为Adam，可以自动调节学习率
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01)

# 记录列表，记录训练时的各种数据，以用于绘图
records = []

# loss 列表，用于记录训练中的 loss
losses = []

#开始训练循环
def trainModel(data, label):
    # 需要将输入的数据进行适当的变形，主要是要多出一个batch_size的维度，也即第一个为1的维度
    # 这样做是为了适应 PyTorch 函数的特殊用法，具体可以参考 PyTorch 官方文档
    x = Variable(torch.FloatTensor(data).view(1,-1))
    # x的尺寸：batch_size=1, len_dictionary
    # 标签也要加一层外衣以变成1*1的张量
    y = Variable(torch.LongTensor(np.array([label])))
    # y的尺寸：batch_size=1, 1

    # 清空梯度
    optimizer.zero_grad()
    # 模型预测
    predict = model(x)
    # 计算损失函数
    loss = cost(predict, y)
    # 将损失函数数值加入到列表中
    losses.append(loss.data.numpy())
    # 开始进行梯度反传
    loss.backward()
    # 开始对参数进行一步优化
    optimizer.step()

#函数会返回模型预测的结果、正确率、损失值
def evaluateModel(data, label):
    x = Variable(torch.FloatTensor(data).view(1, -1))
    y = Variable(torch.LongTensor(np.array([label])))
    # 模型预测
    predict = model(x)
    # 调用rightness函数计算准确度
    right = rightness(predict, y)
    # 计算loss
    loss = cost(predict, y)

    return predict, right, loss

#训练循环部分  循环10个Epoch
for epoch in range(10):
    for i, data in enumerate(zip(train_data, train_label)):
        x, y = data
        # 调用上面编写的训练函数
        # x 即句子向量，y 即标签（0 or 1）
        trainModel(x, y)

        # 每隔3000步，跑一下校验数据集的数据，输出临时结果
        if i % 3000 == 0:
            val_losses = []
            rights = []
            # 在所有校验数据集上实验
            for j, val in enumerate(zip(valid_data, valid_label)):
                x, y = val
                # 调用模型测试函数
                predict, right, loss = evaluateModel(x, y)
                rights.append(right)
                val_losses.append(loss.data.numpy())

            # 将校验集合上面的平均准确度计算出来
            right_ratio = 1.0 * np.sum([i[0] for i in rights]) / np.sum([i[1] for i in rights])
            print('第{}轮，训练损失：{:.2f}, 校验损失：{:.2f}, 校验准确率: {:.2f}'.format(epoch, np.mean(losses),
                                                                        np.mean(val_losses), right_ratio))
            records.append([np.mean(losses), np.mean(val_losses), right_ratio])

在这里插入图片描述

4.3.4 绘制图像观察模型表现

# 绘制误差曲线
a = [i[0] for i in records]
b = [i[1] for i in records]
c = [i[2] for i in records]
plt.plot(a, label = 'Train Loss')
plt.plot(b, label = 'Valid Loss')
plt.plot(c, label = 'Valid Accuracy')
plt.xlabel('Steps')
plt.ylabel('Loss & Accuracy')
plt.legend()

在这里插入图片描述

将模型在整个测试集上运行，记录预测结果，并计算总的正确率。

vals = [] #记录准确率所用列表

#对测试数据集进行循环
for data, target in zip(test_data, test_label):
    data, target = Variable(torch.FloatTensor(data).view(1,-1)), Variable(torch.LongTensor(np.array([target])))
    output = model(data) #将特征数据喂入网络，得到分类的输出
    val = rightness(output, target) #获得正确样本数以及总样本数
    vals.append(val) #记录结果

#计算准确率
rights = (sum([tup[0] for tup in vals]), sum([tup[1] for tup in vals]))
right_rate = 1.0 * rights[0].data.numpy() / rights[1]
print(right_rate)

计算输出的分类准确度可以达到 90% 左右。

4.3.5 保存模型

# 保存、提取模型（为展示用）
torch.save(model,'bow.mdl')
model = torch.load('bow.mdl')

4.4 构建模型（基于 RNN 的简单文本分类器）

4.4.1 数据预处理

good_file = 'data/good.txt'
bad_file  = 'data/bad.txt'

pos_sentences, neg_sentences, diction = Prepare_data(good_file, bad_file, False)

dataset = []
labels = []
sentences = []

#将评论语料向量化，变成句子向量
# 正例集合
for sentence in pos_sentences:
    new_sentence = []
    for l in sentence:
        if l in diction:
            # 注意将每个词编码
            new_sentence.append(word2index(l, diction))
    #每一个句子都是一个不等长的整数序列
    dataset.append(new_sentence)
    labels.append(0)
    sentences.append(sentence)

# 反例集合
for sentence in neg_sentences:
    new_sentence = []
    for l in sentence:
        if l in diction:
            new_sentence.append(word2index(l, diction))
    dataset.append(new_sentence)
    labels.append(1)
    sentences.append(sentence)

# 重新对数据洗牌，构造数据集合
indices = np.random.permutation(len(dataset))
dataset = [dataset[i] for i in indices]
labels = [labels[i] for i in indices]
sentences = [sentences[i] for i in indices]

test_size = len(dataset) // 10

# 训练集
train_data = dataset[2 * test_size :]
train_label = labels[2 * test_size :]

# 校验集
valid_data = dataset[: test_size]
valid_label = labels[: test_size]

# 测试集
test_data = dataset[test_size : 2 * test_size]
test_label = labels[test_size : 2 * test_size]

4.4.2 构建模型

在这里插入图片描述

# 一个手动实现的RNN模型
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size
        # 一个embedding层
        self.embed = nn.Embedding(input_size, hidden_size)
        # 隐含层内部的相互链接
        self.i2h = nn.Linear(2 * hidden_size, hidden_size)
        # 隐含层到输出层的链接
        self.i2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):

        # 先进行embedding层的计算，它可以把一个数或者数列，映射成一个向量或一组向量
        # input尺寸：seq_length, 1
        x = self.embed(input)
        # x尺寸：hidden_size

        # 将输入和隐含层的输出（hidden）耦合在一起构成了后续的输入
        combined = torch.cat((x.view(1, -1), hidden), 1)
        # combined尺寸：2*hidden_size
        #
        # 从输入到隐含层的计算
        hidden = self.i2h(combined)
        # combined尺寸：hidden_size

        # 从隐含层到输出层的运算
        output = self.i2o(hidden)
        # output尺寸：output_size

        # softmax函数
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        # 对隐含单元的初始化
        # 注意尺寸是：batch_size, hidden_size
        return Variable(torch.zeros(1, self.hidden_size))

4.4.3 训练模型

大概预计在30分钟左右

# 开始训练这个RNN，10个隐含层单元
rnn = RNN(len(diction), 10, 2)

# 交叉熵评价函数
cost = torch.nn.NLLLoss()

# Adam优化器
optimizer = torch.optim.Adam(rnn.parameters(), lr = 0.001)
records = []

# 学习周期10次
losses = []
for epoch in range(10):

    for i, data in enumerate(zip(train_data, train_label)):
        x, y = data
        x = Variable(torch.LongTensor(x))
        #x尺寸：seq_length（序列的长度）
        y = Variable(torch.LongTensor([y]))
        #x尺寸：batch_size = 1,1
        optimizer.zero_grad()

        #初始化隐含层单元全为0
        hidden = rnn.initHidden()
        # hidden尺寸：batch_size = 1, hidden_size

        #手动实现RNN的时间步循环，x的长度就是总的循环时间步，因为要把x中的输入句子全部读取完毕
        for s in range(x.size()[0]):
            output, hidden = rnn(x[s], hidden)

        #校验函数
        loss = cost(output, y)
        losses.append(loss.data.numpy())
        loss.backward()
        # 开始优化
        optimizer.step()
        if i % 3000 == 0:
            # 每间隔3000步来一次校验集上面的计算
            val_losses = []
            rights = []
            for j, val in enumerate(zip(valid_data, valid_label)):
                x, y = val
                x = Variable(torch.LongTensor(x))
                y = Variable(torch.LongTensor(np.array([y])))
                hidden = rnn.initHidden()
                for s in range(x.size()[0]):
                    output, hidden = rnn(x[s], hidden)
                right = rightness(output, y)
                rights.append(right)
                loss = cost(output, y)
                val_losses.append(loss.data.numpy())
            # 计算准确度
            right_ratio = 1.0 * np.sum([i[0] for i in rights]) / np.sum([i[1] for i in rights])
            print('第{}轮，训练损失：{:.2f}, 测试损失：{:.2f}, 测试准确率: {:.2f}'.format(epoch, np.mean(losses),
                                                                        np.mean(val_losses), right_ratio))
            records.append([np.mean(losses), np.mean(val_losses), right_ratio])


# 绘制误差曲线
a = [i[0] for i in records]
b = [i[1] for i in records]
c = [i[2] for i in records]
plt.plot(a, label = 'Train Loss')
plt.plot(b, label = 'Valid Loss')
plt.plot(c, label = 'Valid Accuracy')
plt.xlabel('Steps')
plt.ylabel('Loss & Accuracy')
plt.legend()

4.4.4 测试模型

vals = [] #记录准确率所用列表
rights = list(rights)
#对测试数据集进行循环
for j, test in enumerate(zip(test_data, test_label)):
    x, y = test
    x = Variable(torch.LongTensor(x))
    y = Variable(torch.LongTensor(np.array([y])))
    hidden = rnn.initHidden()
    for s in range(x.size()[0]):
        output, hidden = rnn(x[s], hidden)
    right = rightness(output, y)
    rights.append(right)
    val = rightness(output, y) #获得正确样本数以及总样本数
    vals.append(val) #记录结果

#计算准确率
rights = (sum([tup[0] for tup in vals]), sum([tup[1] for tup in vals]))
right_rate = 1.0 * rights[0].data.numpy() / rights[1]
right_rate

4.4.5 保存模型

# 保存、加载模型（为讲解用）
torch.save(rnn, 'rnn.mdl')
rnn = torch.load('rnn.mdl')

4.5 构建模型（基于 LSTM 的简单文本分类器）

LSTM 与 RNN 最大的区别就是在于每个神经元中多增加了3个控制门：遗忘门、输入门和输出门。
另外，在每个隐含层神经元中，LSTM 多了一个 cell 的状态，起到了记忆的作用。
这就使得 LSTM 可以记忆更长时间的 Pattern。

4.5.0 数据预处理（略）

其他步骤与RNN模型相同，省略

4.5.1 构建模型

class LSTMNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(LSTMNetwork, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        # LSTM的构造如下：一个embedding层，将输入的任意一个单词映射为一个向量
        # 一个LSTM隐含层，共有hidden_size个LSTM神经元
        # 一个全链接层，外接一个softmax输出
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers)
        self.fc = nn.Linear(hidden_size, 2)
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden=None):

        #input尺寸: seq_length
        #词向量嵌入
        embedded = self.embedding(input)
        #embedded尺寸: seq_length, hidden_size

        #PyTorch设计的LSTM层有一个特别别扭的地方是，输入张量的第一个维度需要是时间步，
        #第二个维度才是batch_size，所以需要对embedded变形
        embedded = embedded.view(input.data.size()[0], 1, self.hidden_size)
        #embedded尺寸: seq_length, batch_size = 1, hidden_size

        #调用PyTorch自带的LSTM层函数，注意有两个输入，一个是输入层的输入，另一个是隐含层自身的输入
        # 输出output是所有步的隐含神经元的输出结果，hidden是隐含层在最后一个时间步的状态。
        # 注意hidden是一个tuple，包含了最后时间步的隐含层神经元的输出，以及每一个隐含层神经元的cell的状态

        output, hidden = self.lstm(embedded, hidden)
        #output尺寸: seq_length, batch_size = 1, hidden_size
        #hidden尺寸: 二元组(n_layer = 1 * batch_size = 1 * hidden_size, n_layer = 1 * batch_size = 1 * hidden_size)

        #我们要把最后一个时间步的隐含神经元输出结果拿出来，送给全连接层
        output = output[-1,...]
        #output尺寸: batch_size = 1, hidden_size

        #全链接层
        out = self.fc(output)
        #out尺寸: batch_size = 1, output_size
        # softmax
        out = self.logsoftmax(out)
        return out

    def initHidden(self):
        # 对隐单元的初始化

        # 对隐单元输出的初始化，全0.
        # 注意hidden和cell的维度都是layers,batch_size,hidden_size
        hidden = Variable(torch.zeros(self.n_layers, 1, self.hidden_size))
        # 对隐单元内部的状态cell的初始化，全0
        cell = Variable(torch.zeros(self.n_layers, 1, self.hidden_size))
        return (hidden, cell)

4.5.2 训练模型

# 开始训练LSTM网络

# 构造一个LSTM网络的实例
lstm = LSTMNetwork(len(diction), 10, 2)

#定义损失函数
cost = torch.nn.NLLLoss()

#定义优化器
optimizer = torch.optim.Adam(lstm.parameters(), lr = 0.001)
records = []

# 开始训练，一共15个epoch
losses = []
for epoch in range(15):
    for i, data in enumerate(zip(train_data, train_label)):
        x, y = data
        x = Variable(torch.LongTensor(x))
        #x尺寸：seq_length，序列的长度
        y = Variable(torch.LongTensor([y]))
        #y尺寸：batch_size = 1, 1
        optimizer.zero_grad()

        #初始化LSTM隐含层单元的状态
        hidden = lstm.initHidden()
        #hidden: 二元组(n_layer = 1 * batch_size = 1 * hidden_size, n_layer = 1 * batch_size = 1 * hidden_size)

        #让LSTM开始做运算，注意，不需要手工编写对时间步的循环，而是直接交给PyTorch的LSTM层。
        #它自动会根据数据的维度计算若干时间步
        output = lstm(x, hidden)
        #output尺寸: batch_size = 1, output_size

        #损失函数
        loss = cost(output, y)
        losses.append(loss.data.numpy())

        #反向传播
        loss.backward()
        optimizer.step()

        #每隔3000步，跑一次校验集，并打印结果
        if i % 3000 == 0:
            val_losses = []
            rights = []
            for j, val in enumerate(zip(valid_data, valid_label)):
                x, y = val
                x = Variable(torch.LongTensor(x))
                y = Variable(torch.LongTensor(np.array([y])))
                hidden = lstm.initHidden()
                output = lstm(x, hidden)
                #计算校验数据集上的分类准确度
                right = rightness(output, y)
                rights.append(right)
                loss = cost(output, y)
                val_losses.append(loss.data.numpy())
            right_ratio = 1.0 * np.sum([i[0] for i in rights]) / np.sum([i[1] for i in rights])
            print('第{}轮，训练损失：{:.2f}, 测试损失：{:.2f}, 测试准确率: {:.2f}'.format(epoch, np.mean(losses),
                                                                        np.mean(val_losses), right_ratio))
            records.append([np.mean(losses), np.mean(val_losses), right_ratio])         


# 绘制误差曲线
a = [i[0] for i in records]
b = [i[1] for i in records]
c = [i[2] for i in records]
plt.plot(a, label = 'Train Loss')
plt.plot(b, label = 'Valid Loss')
plt.plot(c, label = 'Valid Accuracy')
plt.xlabel('Steps')
plt.ylabel('Loss & Accuracy')
plt.legend()

4.5.3 测试模型

vals = [] #记录准确率所用列表
rights = list(rights)
#对测试数据集进行循环
for j, test in enumerate(zip(test_data, test_label)):
    x, y = test
    x = Variable(torch.LongTensor(x))
    y = Variable(torch.LongTensor(np.array([y])))
    hidden = lstm.initHidden()
    output = lstm(x, hidden)
    right = rightness(output, y)
    rights.append(right)
    val = rightness(output, y) #获得正确样本数以及总样本数
    vals.append(val) #记录结果

#计算准确率
rights = (sum([tup[0] for tup in vals]), sum([tup[1] for tup in vals]))
right_rate = 1.0 * rights[0].data.numpy() / rights[1]
right_rate

4.5.4 保存模型

#保存、加载模型（为讲解用）
torch.save(lstm, 'lstm.mdl')
lstm = torch.load('lstm.mdl')

PyTorch实例2——文本情绪分类器

目录

1. 实验环境

2. 实验目的

3. 相关原理

4. 实验步骤

4.1 数据收集

从在线商城抓取评论数据

4.2 数据预处理（洗数据）

4.3 构建模型（基于词袋模型的简单文本分类器）

4.3.1 数据预处理

4.3.2 构建模型

4.3.3 训练模型

4.3.4 绘制图像观察模型表现

4.3.5 保存模型

4.4 构建模型（基于 RNN 的简单文本分类器）

4.4.1 数据预处理

4.4.2 构建模型

4.4.3 训练模型

4.4.4 测试模型

4.4.5 保存模型

4.5 构建模型（基于 LSTM 的简单文本分类器）

4.5.0 数据预处理（略）

4.5.1 构建模型

4.5.2 训练模型

4.5.3 测试模型

4.5.4 保存模型

相关文章