1. 二分类问题
1.1 IMDB 数据集加载
IMDB 包含5w条严重两极分化的评论,数据集被分为 2.5w 训练数据 和 2.5w 测试数据,训练集和测试集中的正面和负面评论占比都是50%
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
## num_words=10000, 保留训练数据中前10000个最常出现的单词,舍弃低频词;
## train_data 和 test_data 都是评论构成的列表,而评论又由单词索引组成;
## train_labels 和 test_labels 都是 0 和 1 构成的列表;
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 6s 0us/step
print(train_data[0]) ## 将一句话转化为由单词表索引构成的一句话,实现句子转化为向量;
print(test_labels[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
0
1.2 数据集预处理
将整数序列转化为张量
## 将 整数序列 转化为 张量,输入网络
## 用的方法是 one-hot 编码
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
## 单词表中有 10000 个高频词,所以 dimension=10000;
results = np.zeros((len(sequences), dimension)) ## 初始化一个 len(sequences) * 10000 的零矩阵;
for i, word in enumerate(sequences):
results[i, word] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
## 将标签向量化
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")
## 整数序列 转化为 张量后的结果
print(x_train.shape)
print(x_train[0])
print(y_test[0])
(25000, 10000)
[0. 1. 1. ... 0. 0. 0.]
0.0
1.3 构建网络
带有 relu 激活函数的 全连接层 (Dense)的简单堆叠。
Dense堆叠有几个关键问题:
网络有多少层?
每层有多少个隐藏单元
一个隐藏单元 (hidden unit)是该层表示空间的一个维度,比如隐藏单元为16,意思就是将输入数据投影到16维的表示空间中。
隐藏单元越多(即 更高维的表示空间),网络越能够学到更加复杂的表示,但相应的计算代价也变大。
## 定义模型
from keras import models
from keras import layers
model = models.Sequential() ## 构建线性堆叠的网络
model.add(layers.Dense(16, activation="relu", input_shape=(10000, )))
model.add(layers.Dense(16, activation="relu")) ## 16: 该层隐藏单元的个数(也是表示空间的维度);relu: 激活函数(将所有的负值归零)
model.add(layers.Dense(1, activation="sigmoid")) ## sigmoid: 激活函数(将任意值压缩到 [0,1])
## 编译模型
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]) ## 损失函数为 二元交叉熵,优化器为 rmsprop,模型的评估指标用的是 accuracy
Metal device set to: Apple M1
systemMemory: 8.00 GB
maxCacheSize: 2.67 GB
1.4 训练模型
## 留出验证集 (原始训练数据集中留出 10000 个样本作为验证集,剩下的作为训练集)
x_val = x_train[:10000] ## 验证集数据
partial_x_train = x_train[10000:] ## 训练集数据
y_val = y_train[:10000] ## 验证集标签
partial_y_train = y_train[10000:] ## 训练集标签
## 训练模型
history = model.fit(partial_x_train,
partial_y_train,
epochs=20, ## 迭代次数
batch_size=512, ## 批量大小
validation_data=(x_val, y_val)) ## 模型在验证集上的损失和精度
## model.fit() 返回一个 History 对象,该对象有一个 history 成员,它是一个 字典,包含训练过程中的所有数据
Epoch 1/20
2023-06-06 21:55:20.911277: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
30/30 [==============================] - 3s 47ms/step - loss: 0.5334 - accuracy: 0.7871 - val_loss: 0.4422 - val_accuracy: 0.8205
Epoch 2/20
30/30 [==============================] - 1s 19ms/step - loss: 0.3317 - accuracy: 0.8979 - val_loss: 0.3197 - val_accuracy: 0.8862
Epoch 3/20
30/30 [==============================] - 1s 17ms/step - loss: 0.2383 - accuracy: 0.9252 - val_loss: 0.2873 - val_accuracy: 0.8876
Epoch 4/20
30/30 [==============================] - 0s 17ms/step - loss: 0.1843 - accuracy: 0.9424 - val_loss: 0.2761 - val_accuracy: 0.8904
Epoch 5/20
30/30 [==============================] - 0s 16ms/step - loss: 0.1475 - accuracy: 0.9540 - val_loss: 0.2824 - val_accuracy: 0.8876
Epoch 6/20
30/30 [==============================] - 0s 16ms/step - loss: 0.1253 - accuracy: 0.9621 - val_loss: 0.2898 - val_accuracy: 0.8868
Epoch 7/20
30/30 [==============================] - 0s 17ms/step - loss: 0.1007 - accuracy: 0.9719 - val_loss: 0.3063 - val_accuracy: 0.8840
Epoch 8/20
30/30 [==============================] - 0s 17ms/step - loss: 0.0826 - accuracy: 0.9777 - val_loss: 0.3309 - val_accuracy: 0.8778
Epoch 9/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0692 - accuracy: 0.9819 - val_loss: 0.3497 - val_accuracy: 0.8786
Epoch 10/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0544 - accuracy: 0.9868 - val_loss: 0.3707 - val_accuracy: 0.8780
Epoch 11/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0446 - accuracy: 0.9899 - val_loss: 0.4029 - val_accuracy: 0.8761
Epoch 12/20
30/30 [==============================] - 1s 17ms/step - loss: 0.0338 - accuracy: 0.9935 - val_loss: 0.4364 - val_accuracy: 0.8742
Epoch 13/20
30/30 [==============================] - 1s 18ms/step - loss: 0.0315 - accuracy: 0.9932 - val_loss: 0.4550 - val_accuracy: 0.8749
Epoch 14/20
30/30 [==============================] - 1s 19ms/step - loss: 0.0181 - accuracy: 0.9978 - val_loss: 0.4940 - val_accuracy: 0.8726
Epoch 15/20
30/30 [==============================] - 1s 18ms/step - loss: 0.0173 - accuracy: 0.9979 - val_loss: 0.5231 - val_accuracy: 0.8727
Epoch 16/20
30/30 [==============================] - 1s 18ms/step - loss: 0.0137 - accuracy: 0.9977 - val_loss: 0.5800 - val_accuracy: 0.8648
Epoch 17/20
30/30 [==============================] - 1s 18ms/step - loss: 0.0076 - accuracy: 0.9997 - val_loss: 0.6507 - val_accuracy: 0.8583
Epoch 18/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0092 - accuracy: 0.9991 - val_loss: 0.6157 - val_accuracy: 0.8694
Epoch 19/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0041 - accuracy: 0.9999 - val_loss: 0.6636 - val_accuracy: 0.8658
Epoch 20/20
30/30 [==============================] - 0s 16ms/step - loss: 0.0047 - accuracy: 0.9996 - val_loss: 0.6849 - val_accuracy: 0.8677
## history 中包含 训练过程和验证过程 中监控的指标(损失和精度)。
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
1.5 可视化 监控指标
## 训练损失和验证损失
import matplotlib.pyplot as plt
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values)+1)
plt.plot(epochs, loss_values, "bo", label="Training loss") ## "bo" 表示蓝色圆点
plt.plot(epochs, val_loss_values, "b", label="Validation loss") ## "bo" 表示蓝色实线
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
## 练精度和验证精度
acc_values = history_dict["accuracy"]
val_acc_values = history_dict["val_accuracy"]
plt.plot(epochs, acc_values, "bo", label="Training accuracy") ## "bo" 表示蓝色圆点
plt.plot(epochs, val_acc_values, "b", label="Validation accuracy") ## "bo" 表示蓝色实线
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
根据可视化结果可以看出:训练损失每轮都在降低,训练精度每轮都在升高。但验证损失和验证精度似乎在第4轮达到最佳值,所以训练次数增加,模型可能出现过拟合的问题。
为了防止过拟合,可以在第3轮之后停止训练。
1.6 从头开始重新训练模型
## 构建模型
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(10000, )))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
## 编译模型
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
## 训练模型
model.fit(x_train, y_train, epochs=4, batch_size=512) ## 只训练4次
results = model.evaluate(x_test, y_test) ## 模型在测试集上进行评估
print(results)
Epoch 1/4
49/49 [==============================] - 1s 15ms/step - loss: 0.4761 - accuracy: 0.8302
Epoch 2/4
49/49 [==============================] - 1s 12ms/step - loss: 0.2805 - accuracy: 0.9044
Epoch 3/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2104 - accuracy: 0.9257
Epoch 4/4
49/49 [==============================] - 1s 11ms/step - loss: 0.1748 - accuracy: 0.9380
782/782 [==============================] - 4s 5ms/step - loss: 0.2906 - accuracy: 0.8833
[0.29057276248931885, 0.8833200335502625]
1.7 使用训练好的模型在新数据集上生成预测结果
model.predict(x_test)
782/782 [==============================] - 3s 3ms/step
array([[0.2139433],
[0.9987571],
[0.7920793],
...,
[0.0869531],
[0.0685806],
[0.5763908]], dtype=float32)
1.8 小结
- 需要对原始数据进行预处理,将单词序列转化为张量 (word embedding);
- 中间层每层都要用 激活函数;
- 对于二分类问题,最后一层应该只有一个 隐藏单元,并只用 Sigmoid 激活最后一层,使得输出值是 0~1 之间的标量,表示概率值;
- 对于二分类问题的 sigmoid 输出,应该用 二元交叉熵 (binary_corssentropy) 作为 损失函数;
- 模型训练次数增多,可能会出现过拟合的问题,所以要根据相关指标确定最佳训练次数;