深度学习用于文本和序列

说明:

本篇学习记录为：《Python 深度学习》第6章第1节（处理文本数据）

知识点:

深度学习处理文本或序列数据的基本方法是：循环神经网络 (recurrent neural network) 和 一维卷积神经网络 (1D convert)；
这些算法的应用范围包括：文档分类、时间序列分类、时间序列比对、时间序列预测、序列到序列的学习、情感分析等；

文本数据的处理:

文本可以理解为字符序列或单词序列;
深度学习模型不能以原始文本为输入，它只能处理数值型的张量，所以需要将 文本数据做向量化处理；
文本数据向量化的方法：
1). 将文本分割为单词，将每个单词转化为一个向量；
2). 将文本分割为字符，将每个字符转化为一个向量；
3). 提取单词或字符的 n-gram (指的是多个连续单词或字符的集合)，将每个 n-gram 转换为一个向量；
文本向量化的过程就是用某种分词方法，将文本分解成 token（单词、字符或n-gram），然后将数值向量与 token 相关联的过程；

第1节只介绍两种方法：one-hot编码 (one-hot encoding)、词嵌入 (word embedding)

1. 单词和字符的 `one-hot encoding`:

单词级的 one-hot encoding:

import numpy as np

samples = ["The cat sat on the mat.", "The dog ate my homework."]

## 构建 token 词典，存储 {单词 : 索引+1}
token_list = []
for sentence in samples:
    for word in sentence.split():
        token_list.append(word)
token_list = list(set(token_list))
token_list.sort()
token_index = {t:token_list.index(t)+1 for t in token_list} ## 以单词表中每个单词的索引为value，但是索引编号从1开始 (至于为啥编号0没有指定单词，暂时不清楚)
print(token_index)

## 对样本进行分词，每个样本只考虑前10个单词；
max_length = 10
results = np.zeros(shape=(len(samples), max_length, max(token_index.values())+1))
for i, sentence in enumerate(samples):
    for j, word in list(enumerate(sentence.split()))[:max_length]: ## 只考虑前10个单词
        index = token_index[word]
        results[i, j, index] = 1. ## i: 第i个句子；j:第i个句子的第j个单词；index: 该单词的索引，并将矩阵中该索引处的值填充为1.

print(results.shape)
print(results)

{'The': 1, 'ate': 2, 'cat': 3, 'dog': 4, 'homework.': 5, 'mat.': 6, 'my': 7, 'on': 8, 'sat': 9, 'the': 10}
(2, 10, 11)
[[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]

1.1 字符级的 `one-hot encoding`:

import string

samples = ["The cat sat on the mat.", "The dog ate my homework."]

## 所有可打印的 ASCII 字符 (用这些字符构建字符表)
characters = string.printable
print(characters)
token_index = dict(zip(characters, range(1, len(characters)+1)))
print(token_index)
print(len(token_index))

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	

{'0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10, 'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'j': 20, 'k': 21, 'l': 22, 'm': 23, 'n': 24, 'o': 25, 'p': 26, 'q': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'x': 34, 'y': 35, 'z': 36, 'A': 37, 'B': 38, 'C': 39, 'D': 40, 'E': 41, 'F': 42, 'G': 43, 'H': 44, 'I': 45, 'J': 46, 'K': 47, 'L': 48, 'M': 49, 'N': 50, 'O': 51, 'P': 52, 'Q': 53, 'R': 54, 'S': 55, 'T': 56, 'U': 57, 'V': 58, 'W': 59, 'X': 60, 'Y': 61, 'Z': 62, '!': 63, '"': 64, '#': 65, '$': 66, '%': 67, '&': 68, "'": 69, '(': 70, ')': 71, '*': 72, '+': 73, ',': 74, '-': 75, '.': 76, '/': 77, ':': 78, ';': 79, '<': 80, '=': 81, '>': 82, '?': 83, '@': 84, '[': 85, '\\': 86, ']': 87, '^': 88, '_': 89, '`': 90, '{': 91, '|': 92, '}': 93, '~': 94, ' ': 95, '\t': 96, '\n': 97, '\r': 98, '\x0b': 99, '\x0c': 100}
100

max_length = 50 ## 每个句子取前50个字符
results = np.zeros(shape=(len(samples), max_length, len(token_index.values())+1))
for i, sentence in enumerate(samples):
    for j, character in enumerate(sentence[:max_length]):
        index = token_index[character]
        results[i, j, index] = 1. ## 第 i 个句子的 第 j 个字符的index位置填充为1.

print(results.shape)
print(results)

(2, 50, 101)
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]

1.2 用 `Keras` 实现单词级的 `one-hot encoding`:

from keras.preprocessing.text import Tokenizer

samples = ["The cat sat on the mat.", "The dog ate my homework."]
tokenizer = Tokenizer(num_words=1000) ## 只考虑前100个最常用的单词
tokenizer.fit_on_texts(samples) ## 构建单词索引
sequences = tokenizer.texts_to_sequences(samples) ## 将字符串转化为整数索引构成的列表
print(sequences)

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

one_hot_results = tokenizer.texts_to_matrix(samples, mode="binary") ## 可以得到 one-hot 二进制表示 (该分词器也支持除 one-hot编码之外的其他向量化方式)
word_index = tokenizer.word_index ## 获取单词索引
print(word_index)
## 可以看出，Tokenizer 去掉了特殊字符和大小写的区别

{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}

1.3 `one-hot`散列技巧:

one-hot encoding的一种变体是one-hot散列技巧 (one-hot hashing trick)，如果词表中唯一标记的数量太多而无法直接处理，就可以用该方法。
one-hot散列技巧：将单词用散列函数编码为固定长度的向量。
优点在于节省内存并允许数据在线编码；缺点在于可能会出现散列冲突 (hash collision)，即两个不同的单词可能有相同的散列值.

1.4 使用散列技巧的单词级的 `one-hot`编码:

samples = ["The cat sat on the mat.", "The dog ate my homework."]
dimensionality = 1000 ## 将单词保存为长度为1000的向量 (如果单词数量接近1000或者更多，那么会遇到很多散列冲突，从而降低编码的准确性)
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sentence in enumerate(samples):
    for j, word in list(enumerate(sentence.split()))[:max_length]:
        index = abs(hash(word)) % dimensionality  ## 将单词散列为 0-1000 范围内的一个随机整数索引
        results[i, j, index] = 1.

print(results.shape)
print(results)

(2, 10, 1000)
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]

2. 使用词嵌入 (word embedding):

one-hot编码与词嵌入的区别:

one-hot编码得到的向量是二进制的、稀疏的(绝大部分是0)、维度很高的(维度大小等于词表中单词的个数)，即稀疏向量；
词嵌入是低维的浮点数向量 (即密集向量)；
与one-hot编码得到的词向量不同，词嵌入是从数据中学习得到的（常见的词向量维度是256，512，1024）。与此相对的是，one-hot编码的词向量维度通常为20000或更高；
所以相比于one-hot编码，词嵌入得到的词向量更密集，可以将更多的信息储存到更低的维度中。

获取词嵌入有两种方法:

完成主任务的同时学习词嵌入 (和神经网络类似，也是通过不断学习来更新词向量)；
预训练词嵌入 (pretrained word embedding)，即在不同于待解决的任务上预先计算好词嵌入，再将其加载到模型中；

2.1 利用 `Embedding` 层学习词嵌入:

词嵌入的作用应该是将人类的语言映射到几何空间中，在一个合理的嵌入空间中，同义词应该被嵌入到相似的词向量中。
一般而言，任意两个词向量之间的距离 (如 L2距离) 应该和这两个词的语义距离有关 (表示不同事物的词被嵌入到相隔很远的点，而相关的词则更加靠近)。
此外，嵌入空间中，词向量的方向也可以是有意义的。

事实上尚未发现一个理想的词嵌入空间，可以完美地映射人类语言。
一个好的词向量很大程度上取决于任务。

可以将 Embedding 理解为一个字典，将整数索引映射为密集向量，接收整数作为输入，在Embedding中查找这些整数，然后返回相关联的向量。(Embedding层实际上是一种字典查找)

单词索引 --> Embedding层 --> 对应的词向量

2.1.1 加载 IMDB 数据集，准备用于 `Embedding` 层:

from keras.datasets import imdb
from keras.utils import pad_sequences

max_features = 10000 ## 最常见的10000个单词
maxlen = 20 ## 截取每个句子中的前20个单词

(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words=max_features
)

x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

2.1.2 在 IMDB 数据集上使用 `Embedding` 层和分类器:

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()

## 嵌入层
model.add(Embedding(10000, 8, input_length=maxlen)) ##  10000: input_dim (token的数目); 8 output_dim(嵌入层的维度); input_length (输入序列的长度)

## 将嵌入层输出展开，用于Dense层
model.add(Flatten())

## Dense层 (线性的分类器)
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2) ## validation_split (从训练集中划分出20%的数据用作验证集)

## 结果如下所示：验证精度为约75%

625/625 [==============================] - 7s 10ms/step - loss: 0.6665 - acc: 0.6278 - val_loss: 0.6120 - val_acc: 0.7038
Epoch 2/10
625/625 [==============================] - 5s 8ms/step - loss: 0.5376 - acc: 0.7504 - val_loss: 0.5248 - val_acc: 0.7346
Epoch 3/10
625/625 [==============================] - 5s 8ms/step - loss: 0.4603 - acc: 0.7874 - val_loss: 0.4994 - val_acc: 0.7486
Epoch 4/10
625/625 [==============================] - 5s 8ms/step - loss: 0.4205 - acc: 0.8093 - val_loss: 0.4956 - val_acc: 0.7540
Epoch 5/10
625/625 [==============================] - 5s 8ms/step - loss: 0.3928 - acc: 0.8253 - val_loss: 0.4929 - val_acc: 0.7574
Epoch 6/10
625/625 [==============================] - 6s 9ms/step - loss: 0.3699 - acc: 0.8377 - val_loss: 0.4963 - val_acc: 0.7580
Epoch 7/10
625/625 [==============================] - 5s 8ms/step - loss: 0.3501 - acc: 0.8503 - val_loss: 0.5033 - val_acc: 0.7572
Epoch 8/10
625/625 [==============================] - 5s 8ms/step - loss: 0.3315 - acc: 0.8612 - val_loss: 0.5098 - val_acc: 0.7554
Epoch 9/10
625/625 [==============================] - 5s 8ms/step - loss: 0.3141 - acc: 0.8704 - val_loss: 0.5180 - val_acc: 0.7556
Epoch 10/10
625/625 [==============================] - 5s 8ms/step - loss: 0.2974 - acc: 0.8777 - val_loss: 0.5265 - val_acc: 0.7516

2.2 使用预训练的词嵌入:

将一些词嵌入算法（如 word2vec）或者预计算的词嵌入（如：GloVe）运用到 Embedding 层中，实现文本数据的向量化。

word2vec: 2013年由谷歌开发;
GloVe: 2014年由斯坦福大学开发;

书中以GloVe词嵌入算法为例处理文本数据。

2.2.1 下载并读取 IMDB 原始数据集:

下载地址: https://mng.bz/0tIo

## 读取训练数据集中的正负样本
import os

labels, texts = [], []

train_pos_dir = "./aclImdb/train/pos/"
train_neg_dir = "./aclImdb/train/neg/"

for posf in os.listdir(train_pos_dir):
    if posf.endswith(".txt"):
        with open(train_pos_dir+posf) as pf:
            texts.append(pf.read())
        labels.append(1)

for negf in os.listdir(train_neg_dir):
    if negf.endswith(".txt"):
        with open(train_neg_dir+negf) as nf:
            texts.append(nf.read())
        labels.append(0)

2.2.2 对文本进行分词并划分训练集和验证集:

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
import numpy as np

maxlen = 100 ## 句子的长度为100个单词
training_samples = 200 ## 训练样本有200个
validation_samples = 10000 ## 在1w个样本上进行验证
max_words = 10000 ## Token的数目，只考虑数据集中前1w个最常见的单词

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts=texts) ## 只保留文本中前1w个最常见的单词
sequence = tokenizer.texts_to_sequences(texts) ## 将文本转化成由token索引构成的序列
word_index = tokenizer.word_index

data = pad_sequences(sequence, maxlen=maxlen) ## 统一序列长度

labels = np.array(labels)

## 打乱样本顺序
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

## 划分训练集和验证集
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples+validation_samples]
y_val = labels[training_samples: training_samples+validation_samples]

2.2.3 下载 `GloVe` 词嵌入并进行预处理:

下载预计算的词嵌入，地址: https://nlp.stanford.edu/projects/glove/
因为原始链接失效，所以我从Kaggle上下载的，地址：https://www.kaggle.com/datasets/anindya2906/glove6b

书中用到的是 glove.6B.zip 中的 100 维嵌入向量（对应 glove.6B.100d.txt）

## 查看 glove.6B.100d.txt 文件
! head -n 5 ./archive/glove.6B.100d.txt

the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
, -0.10767 0.11053 0.59812 -0.54361 0.67396 0.10663 0.038867 0.35481 0.06351 -0.094189 0.15786 -0.81665 0.14172 0.21939 0.58505 -0.52158 0.22783 -0.16642 -0.68228 0.3587 0.42568 0.19021 0.91963 0.57555 0.46185 0.42363 -0.095399 -0.42749 -0.16567 -0.056842 -0.29595 0.26037 -0.26606 -0.070404 -0.27662 0.15821 0.69825 0.43081 0.27952 -0.45437 -0.33801 -0.58184 0.22364 -0.5778 -0.26862 -0.20425 0.56394 -0.58524 -0.14365 -0.64218 0.0054697 -0.35248 0.16162 1.1796 -0.47674 -2.7553 -0.1321 -0.047729 1.0655 1.1034 -0.2208 0.18669 0.13177 0.15117 0.7131 -0.35215 0.91348 0.61783 0.70992 0.23955 -0.14571 -0.37859 -0.045959 -0.47368 0.2385 0.20536 -0.18996 0.32507 -1.1112 -0.36341 0.98679 -0.084776 -0.54008 0.11726 -1.0194 -0.24424 0.12771 0.013884 0.080374 -0.35414 0.34951 -0.7226 0.37549 0.4441 -0.99059 0.61214 -0.35111 -0.83155 0.45293 0.082577
. -0.33979 0.20941 0.46348 -0.64792 -0.38377 0.038034 0.17127 0.15978 0.46619 -0.019169 0.41479 -0.34349 0.26872 0.04464 0.42131 -0.41032 0.15459 0.022239 -0.64653 0.25256 0.043136 -0.19445 0.46516 0.45651 0.68588 0.091295 0.21875 -0.70351 0.16785 -0.35079 -0.12634 0.66384 -0.2582 0.036542 -0.13605 0.40253 0.14289 0.38132 -0.12283 -0.45886 -0.25282 -0.30432 -0.11215 -0.26182 -0.22482 -0.44554 0.2991 -0.85612 -0.14503 -0.49086 0.0082973 -0.17491 0.27524 1.4401 -0.21239 -2.8435 -0.27958 -0.45722 1.6386 0.78808 -0.55262 0.65 0.086426 0.39012 1.0632 -0.35379 0.48328 0.346 0.84174 0.098707 -0.24213 -0.27053 0.045287 -0.40147 0.11395 0.0062226 0.036673 0.018518 -1.0213 -0.20806 0.64072 -0.068763 -0.58635 0.33476 -1.1432 -0.1148 -0.25091 -0.45907 -0.096819 -0.17946 -0.063351 -0.67412 -0.068895 0.53604 -0.87773 0.31802 -0.39242 -0.23394 0.47298 -0.028803
of -0.1529 -0.24279 0.89837 0.16996 0.53516 0.48784 -0.58826 -0.17982 -1.3581 0.42541 0.15377 0.24215 0.13474 0.41193 0.67043 -0.56418 0.42985 -0.012183 -0.11677 0.31781 0.054177 -0.054273 0.35516 -0.30241 0.31434 -0.33846 0.71715 -0.26855 -0.15837 -0.47467 0.051581 -0.33252 0.15003 -0.1299 -0.54617 -0.37843 0.64261 0.82187 -0.080006 0.078479 -0.96976 -0.57741 0.56491 -0.39873 -0.057099 0.19743 0.065706 -0.48092 -0.20125 -0.40834 0.39456 -0.02642 -0.11838 1.012 -0.53171 -2.7474 -0.042981 -0.74849 1.7574 0.59085 0.04885 0.78267 0.38497 0.42097 0.67882 0.10337 0.6328 -0.026595 0.58647 -0.44332 0.33057 -0.12022 -0.55645 0.073611 0.20915 0.43395 -0.012761 0.089874 -1.7991 0.084808 0.77112 0.63105 -0.90685 0.60326 -1.7515 0.18596 -0.50687 -0.70203 0.66578 -0.81304 0.18712 -0.018488 -0.26757 0.727 -0.59363 -0.34839 -0.56094 -0.591 1.0039 0.20664
to -0.1897 0.050024 0.19084 -0.049184 -0.089737 0.21006 -0.54952 0.098377 -0.20135 0.34241 -0.092677 0.161 -0.13268 -0.2816 0.18737 -0.42959 0.96039 0.13972 -1.0781 0.40518 0.50539 -0.55064 0.4844 0.38044 -0.0029055 -0.34942 -0.099696 -0.78368 1.0363 -0.2314 -0.47121 0.57126 -0.21454 0.35958 -0.48319 1.0875 0.28524 0.12447 -0.039248 -0.076732 -0.76343 -0.32409 -0.5749 -1.0893 -0.41811 0.4512 0.12112 -0.51367 -0.13349 -1.1378 -0.28768 0.16774 0.55804 1.5387 0.018859 -2.9721 -0.24216 -0.92495 2.1992 0.28234 -0.3478 0.51621 -0.43387 0.36852 0.74573 0.072102 0.27931 0.92569 -0.050336 -0.85856 -0.1358 -0.92551 -0.33991 -1.0394 -0.067203 -0.21379 -0.4769 0.21377 -0.84008 0.052536 0.59298 0.29604 -0.67644 0.13916 -1.5504 -0.20765 0.7222 0.52056 -0.076221 -0.15194 -0.13134 0.058617 -0.31869 -0.61419 -0.62393 -0.41548 -0.038175 -0.39804 0.47647 -0.15983

## 对解压的词嵌入文件进行解析，将单词映射为词向量的索引
embedding_index = {}
f = open("./archive/glove.6B.100d.txt")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.array(values[1:], dtype="float32")
    embedding_index[word] = coefs ## {"单词": 词向量}
f.close()

2.2.4 构建 GloVe 词嵌入矩阵:

将 GloVe 文件解析结果转化为 shape=(max_words, embedding_dim)的矩阵。

embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector ## 保留了 embedding_vector中的前100列

2.2.5 定义模型:

和上面用到的模型类似（多了一层Dense）。

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_7 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_7 (Flatten)         (None, 10000)             0         
                                                                 
 dense_13 (Dense)            (None, 32)                320032    
                                                                 
 dense_14 (Dense)            (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________

2.2.6 在 `Embedding` 中加载 `GloVe` 词嵌入矩阵:

因为用的是预训练的词嵌入矩阵，所以需要将 Embedding 层“冻结”，确保该层在模型的训练过程中参数不变。

## 冻结 Embedding 层
model.layers[0].set_weights([embedding_matrix]) ## 将 glove 词嵌入矩阵设为该层权重
model.layers[0].trainable = False

2.2.7 训练和评估模型:

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["acc"])

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save("pre_trained_glove_model.h5")

Epoch 1/10
7/7 [==============================] - 2s 293ms/step - loss: 1.6333 - acc: 0.4850 - val_loss: 0.7132 - val_acc: 0.5178
Epoch 2/10
7/7 [==============================] - 1s 220ms/step - loss: 0.4865 - acc: 0.7500 - val_loss: 1.3270 - val_acc: 0.4926
Epoch 3/10
7/7 [==============================] - 1s 224ms/step - loss: 0.4093 - acc: 0.8500 - val_loss: 0.7674 - val_acc: 0.5204
Epoch 4/10
7/7 [==============================] - 1s 217ms/step - loss: 0.2749 - acc: 0.9150 - val_loss: 0.7042 - val_acc: 0.5594
Epoch 5/10
7/7 [==============================] - 1s 221ms/step - loss: 0.1683 - acc: 0.9700 - val_loss: 1.0019 - val_acc: 0.5115
Epoch 6/10
7/7 [==============================] - 1s 221ms/step - loss: 0.1868 - acc: 0.9400 - val_loss: 0.8960 - val_acc: 0.5217
Epoch 7/10
7/7 [==============================] - 1s 220ms/step - loss: 0.2400 - acc: 0.8950 - val_loss: 0.7848 - val_acc: 0.5454
Epoch 8/10
7/7 [==============================] - 1s 244ms/step - loss: 0.0482 - acc: 1.0000 - val_loss: 0.7529 - val_acc: 0.5622
Epoch 9/10
7/7 [==============================] - 1s 233ms/step - loss: 0.0364 - acc: 1.0000 - val_loss: 0.8478 - val_acc: 0.5441
Epoch 10/10
7/7 [==============================] - 1s 220ms/step - loss: 0.0274 - acc: 1.0000 - val_loss: 0.7957 - val_acc: 0.5588

## 可视化 loss
import matplotlib.pyplot as plt
history_dict = history.history

loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values)+1)

plt.plot(epochs, loss_values, "bo", label="Training loss") ## "bo" 表示蓝色圆点
plt.plot(epochs, val_loss_values, "b", label="Validation loss") ## "bo" 表示蓝色实线
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

在这里插入图片描述

## 可视化 acc
## 训练精度和验证精度
acc_values = history_dict["acc"]
val_acc_values = history_dict["val_acc"]
epochs = range(1, len(acc_values)+1)

plt.plot(epochs, acc_values, "bo", label="Training accuracy") ## "bo" 表示蓝色圆点
plt.plot(epochs, val_acc_values, "b", label="Validation accuracy") ## "bo" 表示蓝色实线
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()

plt.show()

在这里插入图片描述

因为训练样本比较少，所以模型的性能严重依赖于所选取的训练样本，而由于训练样本是随机取的，所以得到的模型的性能也是随机的，所以上述 acc 和 loss 的数值在每次实验中不一定是一样的。

根据上述可视化结果可以发现，模型很快出现过拟合(验证指标不变，训练指标仍在变)，而且验证的 acc 波动很大。理论上讲，该模型在测试集上的效果应该不会很好。

2.2.8 用测试集对模型进行评估:

## 读取测试数据集中的正负样本
import os

labels, texts = [], []

test_pos_dir = "./aclImdb/test/pos/"
test_neg_dir = "./aclImdb/test/neg/"

for posf in os.listdir(train_pos_dir):
    if posf.endswith(".txt"):
        with open(train_pos_dir+posf) as pf:
            texts.append(pf.read())
        labels.append(1)

for negf in os.listdir(train_neg_dir):
    if negf.endswith(".txt"):
        with open(train_neg_dir+negf) as nf:
            texts.append(nf.read())
        labels.append(0)

## 对测试样本进行分词
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequence, maxlen=maxlen)
y_test = np.array(labels)

## 加载模型并获取测试精度
model.load_weights("./pre_trained_glove_model.h5")
model.evaluate(x_test, y_test)
## test loss 和 test accuracy

782/782 [==============================] - 4s 5ms/step - loss: 0.7888 - acc: 0.5620

[0.7887667417526245, 0.562000036239624]

从结果可以看出测试精度只有56%，所以说处理样本量很少的数据集是很困难的事。

Keras-5-深度学习用于文本和序列-处理文本数据

深度学习用于文本和序列

说明:

文本数据的处理:

1. 单词和字符的 one-hot encoding:

1.1 字符级的 one-hot encoding:

1.2 用 Keras 实现单词级的 one-hot encoding:

1.3 one-hot散列技巧:

1.4 使用散列技巧的单词级的 one-hot编码: