动手学深度学习V2每日笔记（模型选择+过拟合和欠拟合）

本文主要参考沐神的视频教程 https://www.bilibili.com/video/BV1K64y1Q7wu/?spm_id_from=333.788.recommend_more_video.0&vd_source=c7bfc6ce0ea0cbe43aa288ba2713e56d
文档教程 https://zh-v2.d2l.ai/

本文的主要内容对沐神提供的代码中个人不太理解的内容进行笔记记录，内容不会特别严谨仅供参考。

1.函数目录

1.1 numpy

numpy	位置
zeros	4.1
array	4.1
np.random.normal	4.1
np.random.shuffle	4.1
np.power	4.1

1.2 torch.Tensor

Tensor	位置
sum	4.2
numel	4.2

1.3 pandas

pandas	位置
DataFrame	4.2

1.4 matplotlib.pyplot

plt	位置
figure	4.3
subplot	4.3
plot	4.3
legend	4.3
xlabel or ylabel	4.3
show	4.3

2.模型选择

2.1训练误差和泛化误差

训练误差：模型在训练数据上的误差
泛化误差：模型在新数据上的误差
我们关心的是泛化误差

2.2 验证数据集和测试数据集

验证数据集：一个用来评估模型好坏的数据集。
1.拿出50%的训练数据。
2.不要跟训练数据混在一起。
3.验证数据集也有可能是虚高的，并不代表你在新数据集上的泛化能力。
测试数据集：**只用一次的数据集。**例如
1.未来的考试。
2.我出价的房子的实际成交价。
3.用在Kaggle私有排行榜中的数据集。

2.3 K-则交叉验证

在没有足够多数据时使用（这是常态）
算法：
1.将训练数据分成K块
2.For i=i,…,K
使用第i块作为验证数据集，其余的作为训练数据集
3.报告K个验证集误差的平均
常用：K=5或10

3. 过拟合和欠拟合

3.1 模型容量

拟合各种函数的能力
低容量的模型难以你和训练数据
高容量的模型可以记住所有的训练数据

3.2 估计模型容量

难以在不同的种类算法之间比较
例如树模型和神经网络
给定一个模型种类，将有两个主要因素
1.参数个数
2.参数值的选择范围

线性模型	单隐藏层感知机
d+1	(d+1)m+(m+1)k

在这里插入图片描述

3.2 数据复杂度

多个重要因素
样本个数
每个样本的元素个数
时间、空间结构
多样性

4. 多项式回归

4.1 生成数据集

4.1.1 np.zeros

返回给定形状和类型的新数组，并填充零。

def zeros(shape, dtype=None, order='C', *args, **kwargs):

参数
shape：int或者int元组
dtype:数据类型，可选项

import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

max_degree = 20
n_train, n_test = 100, 100
true_w = np.zeros(max_degree)
print(true_w)
#输入为
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

4.1.2 array

np.array()用于将输入数据（如列表、元组等）转换为 NumPy 数据库的函数。

def array(p_object, dtype=None, *args, **kwargs):

参数
p_object：输入的数据（例如列表、元组）

4.1.3 np.random.normal

np.random.normal() 用于生成服从指定均值和标准差的正态分布的随机数，可以生成单个随机数，也可以生成多维数组形式的随机数。

# 默认均值为0，方差为1
numpy.random.normal(loc=0.0, scale=1.0, size=None)

参数
loc：浮点数或数组，表示正态分布的均值（默认值为 0.0）。
scale：浮点数或数组，表示正态分布的标准差（默认值为 1.0）。
size：整数或元组，表示输出的形状。如果是整数，则返回指定数量的随机数。如果是元组，则返回相应形状的数组。
返回值
返回一个或多个服从指定正态分布的随机数。

4.1.4 np.random.shuffle

np.random.shuffle() 是 NumPy 库中的一个函数，用于对数组中的元素进行就地随机打乱（shuffle）。它可以作用于一维数组，也可以沿着指定的轴对多维数组进行打乱。

numpy.random.shuffle(x)

参数
x：需要打乱顺序的数组。
返回值
np.random.shuffle() 不返回任何值，因为它是就地操作，直接修改输入数组 x。

4.1.5 np.power

np.power() 是 NumPy 库中的一个函数，用于逐元素地计算两个数组的幂次。它返回一个数组，其中每个元素是对应输入数组元素的幂次结果。这个函数可以处理标量和数组作为输入。

numpy.power(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])

参数
x1：数组或标量，表示底数。
x2：数组或标量，表示指数。
out（可选）：输出数组，保存结果。如果提供，该数组的形状必须与输入数组的形状相同。
where（可选）：数组或条件，可选，用于指定计算位置，默认为 True。
casting（可选）：字符串，用于指定数组的类型转换规则。
order（可选）：指定输出数组的存储顺序。
dtype（可选）：指定输出数组的数据类型。
subok（可选）：布尔值，表示是否使用子类进行输出。
返回值
返回一个数组，其中每个元素是 x1 中对应元素的 x2 次幂。

a = np.array([1, 2, 3, 4])
b = np.power(a, np.arange(len(a)))
print(b)
# [ 1  2  9 64]

4.1.6 np.arange

np.arange 是 NumPy 库中的一个函数，用于生成一个包含等差序列的数组。它类似于 Python 内置的 range 函数，但返回的是一个 NumPy 数组，而不是列表。

numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)

参数
start（可选）：序列的起始值。默认为 0。
stop：序列的终止值（不包含在内）。
step（可选）：序列的步长。默认为 1。
dtype（可选）：输出数组的数据类型。如果未指定，则推断数据类型以最小化内存消耗。
like（可选）：参考数组。如果提供，结果将与该数组具有相同的类型和属性。
返回值
返回一个包含等差序列的 NumPy 数组。

max_degree = 20
n_train, n_test = 100, 100
true_w = np.zeros(max_degree)
true_w[0:4] = np.array([5.0, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_test+n_train, 1))#features的shape为（200,1）
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))#poly_features的shape为（200,20）
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i+1)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)
true_w, features, poly_features, labels = [torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]]
print(features[:2])
print(poly_features[:2,:])
print(labels[:2])

4.2 对模型进行训练和测试

4.2.1 Tensor.sum()

Tensor.sum() 是 PyTorch 中用于计算张量所有元素的总和的函数。它有多个参数，可以灵活地应用于不同维度，或根据需要返回标量或张量。

sum(dim=None, keepdim=False, dtype=None) → Tensor

参数说明
dim (int or tuple of ints, optional): 指定沿哪个维度计算总和。如果不指定，则计算所有元素的总和。
keepdim (bool, optional): 如果为 True，则保留原来的维度。默认值为 False。
dtype (torch.dtype, optional): 指定输出的数据类型。如果不指定，则与输入张量的数据类型相同。

x = torch.tensor([[1, 2, 3], [4, 5, 6]])

sum_dim0 = x.sum(dim=0)
print(sum_dim0)  # 输出: tensor([5, 7, 9])

sum_dim1 = x.sum(dim=1)
print(sum_dim1)  # 输出: tensor([ 6, 15])

sum_keepdim = x.sum(dim=0, keepdim=True)
print(sum_keepdim)  # 输出: tensor([[5, 7, 9]])

4.2.2 Tensor.numel()

返回输入张量中元素的总数。

a = torch.randn(1, 2, 3, 4, 5)
a.numel() #输出为120

4.2.3 pandas.DataFrame()

pd.DataFrame() 是 pandas 库中用于创建 DataFrame 对象的函数。DataFrame 是一种二维的、带标签的数据结构，可以看作是共享相同索引的 Series 的集合，类似于数据库中的表或 Excel 表格。

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

参数说明
data：数据，支持多种格式，例如字典、列表、NumPy 数组等。
index：索引标签，用于指定行标签。如果不指定，默认使用 RangeIndex (从 0 开始)。
columns：列标签，用于指定列标签。如果不指定，从数据中推断。
dtype：数据类型，用于指定 DataFrame 的数据类型。如果不指定，从数据中推断。
copy：布尔值，默认为 False。如果为 True，则复制输入数据。
1. 从字典创建 DataFrame

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

在这里插入图片描述
2. 从列表创建 DataFrame

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['name', 'age', 'city'])
print(df)

3. 从 NumPy 数组创建 DataFrame

import numpy as np

data = np.array([
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['name', 'age', 'city'])
print(df)

计算损失

def evaluate_loss(net, data_iter, loss):
    metric = d2l.Accumulator(2)
    for X,y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        #l.sum()：用于累加当前批量的总损失。
        #l.numel()：用于累加当前批量的样本数量。
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]

单轮训练

def train_epoch_ch3(net, train_iter, loss, trainer):#@save
    if isinstance(net, nn.Module):
        net.train()
    metric = d2l.Accumulator(3)
    for X, y in train_iter:
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(trainer,torch.optim.Optimizer):
            # 将梯度初始化为0
            trainer.zero_grad()
            # 反向传播计算
            l.mean().backward()
            # 根据网络反向传播的梯度信息来更新网络的参数，以起到降低loss函数计算值的作用
            trainer.step()
        else:
            l.sum().backward()
            trainer(X.shape[0])
        metric.add(float(l.sum()), d2l.accuracy(y_hat,y), y.numel())
    return metric[0]/metric[2], metric[1]/metric[2]

多轮训练

def train(train_features, test_features, train_labels, test_labels, num_epochs=400):
    loss = nn.MSELoss(reduction='none')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    # 训练集损失列表
    train_loss_all = []
    # 验证集损失列表
    val_loss_all = []
    for epoch in range(num_epochs):
        train_loss, _ = train_epoch_ch3(net, train_iter, loss, trainer)
        test_loss = evaluate_loss(net, test_iter, loss)
        train_loss_all.append(train_loss)
        val_loss_all.append(test_loss)
        if epoch==0 or ((epoch+1)%20)==0:
            print(f"第{epoch + 1}轮训练集中的损失为{train_loss}")
            print(f"第{epoch + 1}轮验证集中的损失为{test_loss}")

    train_process = pd.DataFrame(data={"epoch": range(num_epochs),
                                       "train_loss_all": train_loss_all,
                                       "val_loss_all": val_loss_all,
                                        })
    return train_process

4.3 图形显示

4.3.1 plt.figure

plt.figure() 是 matplotlib 库中用于创建一个新的绘图窗口（figure）的函数。它提供了多种参数来定制绘图窗口的属性，包括大小、分辨率和背景颜色等。下面是 plt.figure() 的详细作用和用法。

matplotlib.pyplot.figure(num=None, figsize=None, dpi=None, *, facecolor=None, edgecolor=None, frameon=True, 
FigureClass=<class 'matplotlib.figure.Figure'>, clear=False, **kwargs)[source]

参数说明
num：int 或 str，默认为 None
如果传递一个整数，将设置该 figure 的 ID。如果该 ID 已存在，则选择该 figure，而不创建新的。如果传递一个字符串，将该字符串作为 figure 的标题。
figsize：tuple，默认为 None
指定 figure 的大小，格式为 (宽, 高)，单位为英寸。
dpi：int，默认为 100
指定 figure 的分辨率（每英寸的点数）。
facecolor：color，默认为 ‘white’
指定 figure 的背景颜色。
edgecolor：color，默认为 ‘white’
指定 figure 的边框颜色。
frameon：bool，默认为 True
指定是否绘制 figure 的边框。
tight_layout：bool 或 dict，默认为 None
如果为 True 或传递一个字典，将自动调整子图参数以适合 figure 区域。
constrained_layout：bool，默认为 None
如果为 True，将自动调整子图参数以适合 figure 区域，并且不会重叠。

fig = plt.figure(facecolor='lightgray', edgecolor='blue')  # 背景颜色为浅灰色，边框颜色为蓝色
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()

在这里插入图片描述

4.3.2 plt.subplot

plt.subplot() 是 Matplotlib 中用于在一个绘图窗口中创建多个子图的函数。它允许你在一个窗口中排列多个图表，使得可以在同一个图形中比较不同的数据或视图。下面是 plt.subplot() 的详细用法和作用。

plt.subplot(nrows, ncols, index)

nrows：子图的行数。
ncols：子图的列数。
index：子图的位置，从 1 开始计数，从左到右，从上到下排列。

plt.subplot(2, 2, 1)  # 2行2列的网格，选择第1个子图
plt.plot([1, 2, 3], [4, 5, 6])
plt.title('Subplot 1')

plt.subplot(2, 2, 2)  # 2行2列的网格，选择第2个子图
plt.plot([1, 2, 3], [6, 5, 4])
plt.title('Subplot 2')

plt.subplot(2, 2, 3)  # 2行2列的网格，选择第3个子图
plt.plot([1, 2, 3], [5, 6, 7])
plt.title('Subplot 3')

plt.subplot(2, 2, 4)  # 2行2列的网格，选择第4个子图
plt.plot([1, 2, 3], [7, 6, 5])
plt.title('Subplot 4')

plt.tight_layout()  # 自动调整子图参数，使之填充整个图像区域
plt.show()

在这里插入图片描述

4.3.3 plt.plot()

plt.plot() 是 Matplotlib 中用于绘制二维线图的基本函数。它可以绘制各种类型的线图，包括折线图、散点图、曲线图等。

import matplotlib.pyplot as plt

plt.plot(x, y, fmt, **kwargs)

x：x轴数据，可以是列表、数组或类似的序列。
y：y轴数据，可以是列表、数组或类似的序列。
fmt：可选，指定线条的格式字符串。
kwargs：可选，其他绘图参数。
1. 使用格式字符串
格式字符串可以包含颜色、线型和标记类型的简写。

颜色	线型	标记
b(蓝色)	-(实线)	.(点)
g(绿色)	–(虚线)	,(像素点)
r(红色)	-.(点划线)	v(上三角)
c(青色)	:(点线)	^(下三角)
m(品红)		s(正方形)
y(黄色)		*(星号)
k(黑色)		1（下三瓣叶）
w(白色)		D(菱形)

标记：‘.’（点），‘,’（像素点），‘o’（圆圈），‘v’（下三角），‘^’（上三角），‘<’（左三角），‘>’（右三角），‘1’（下三瓣叶），‘2’（上三瓣叶），‘3’（左三瓣叶），‘4’（右三瓣叶），‘s’（正方形），‘p’（五边形），‘*’（星号），‘h’（六边形1），‘H’（六边形2），‘+’（加号），‘x’（x号），‘D’（菱形），‘d’（薄菱形），‘|’（竖直线），‘_’（水平线)

2. 使用关键字参数

plt.plot(x, y, color='green', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=12)
plt.show()

常见的关键字参数：

color 或 c：指定线条颜色。
linestyle 或 ls：指定线型。
linewidth 或 lw：指定线宽。
marker：指定标记类型。
markersize 或 ms：指定标记大小。
markerfacecolor 或 mfc：指定标记填充颜色。
markeredgecolor 或 mec：指定标记边缘颜色

plt.subplot(2, 2, 1)  # 2行2列的网格，选择第1个子图
plt.plot([1, 2, 3], [4, 5, 6],'gs--')
plt.title('Subplot 1')

plt.subplot(2, 2, 2)  # 2行2列的网格，选择第2个子图
plt.plot([1, 2, 3], [6, 5, 4],'ro-.')
plt.title('Subplot 2')

plt.subplot(2, 2, 3)  # 2行2列的网格，选择第3个子图
plt.plot([1, 2, 3], [5, 6, 7],'ch-')
plt.title('Subplot 3')

plt.subplot(2, 2, 4)  # 2行2列的网格，选择第4个子图
plt.plot([1, 2, 3], [7, 6, 5],'yp:')
plt.title('Subplot 4')

plt.tight_layout()  # 自动调整子图参数，使之填充整个图像区域
plt.show()

在这里插入图片描述

4.3.4 plt.legend()

plt.legend() 是 Matplotlib 中用于添加图例（legend）的函数。图例是显示在图形上的一个区域，用于解释不同数据系列（如线条、点、柱等）的含义。通过图例，读者可以更容易地理解图表中的各个元素。
1.自动检测图例中要显示的元素
当不传入任何额外的参数时，绘图会自动确定要添加元素到图例中。

import matplotlib.pyplot as plt

# 绘制多条线
plt.plot([1, 2, 3], [4, 5, 6], label='Line 1')
plt.plot([1, 2, 3], [6, 5, 4], label='Line 2')

# 添加图例
plt.legend()
plt.show()

在这里插入图片描述

4.3.5 plt.xlabel() or plt.ylabel()

plt.xlabel() 设置x轴标签。
plt.ylabel() 设置y轴标签。

4.3.6 plt.show()

显示当前图像窗口的所有内容。

 matplot_acc_loss(train_process):
    # 显示每一次迭代后的训练集和验证集的损失函数和准确率
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 1, 1)
    plt.plot(train_process['epoch'], train_process.train_loss_all, "ro-", label="Train loss")
    plt.plot(train_process['epoch'], train_process.val_loss_all, "bs-", label="Val loss")
    plt.legend()
    plt.xlabel("epoch")
    plt.ylabel("Loss")
    plt.show()

5.完整代码

import math
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from torch import nn
from d2l import torch as d2l

max_degree = 20
n_train, n_test = 100, 100
true_w = np.zeros(max_degree)
true_w[0:4] = np.array([5.0, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_test+n_train, 1))#features的shape为（200,1）
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))#poly_features的shape为（200,20）
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i+1)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)
true_w, features, poly_features, labels = [torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]]
print(features[:2])
print(poly_features[:2,:])
print(labels[:2])
# 对模型进行训练和测试
def evaluate_loss(net, data_iter, loss):
    metric = d2l.Accumulator(2)
    for X,y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        #l.sum()：用于累加当前批量的总损失。
        #l.numel()：用于累加当前批量的样本数量。
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]

def train_epoch_ch3(net, train_iter, loss, trainer):#@save
    if isinstance(net, nn.Module):
        net.train()
    metric = d2l.Accumulator(3)
    for X, y in train_iter:
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(trainer,torch.optim.Optimizer):
            # 将梯度初始化为0
            trainer.zero_grad()
            # 反向传播计算
            l.mean().backward()
            # 根据网络反向传播的梯度信息来更新网络的参数，以起到降低loss函数计算值的作用
            trainer.step()
        else:
            l.sum().backward()
            trainer(X.shape[0])
        metric.add(float(l.sum()), d2l.accuracy(y_hat,y), y.numel())
    return metric[0]/metric[2], metric[1]/metric[2]

def train(train_features, test_features, train_labels, test_labels, num_epochs=400):
    loss = nn.MSELoss(reduction='none')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    # 训练集损失函数
    # 训练集损失列表
    train_loss_all = []
    # 验证集损失列表
    val_loss_all = []
    for epoch in range(num_epochs):
        train_loss, _ = train_epoch_ch3(net, train_iter, loss, trainer)
        test_loss = evaluate_loss(net, test_iter, loss)
        train_loss_all.append(train_loss)
        val_loss_all.append(test_loss)
        if epoch==0 or ((epoch+1)%20)==0:
            print(f"第{epoch + 1}轮训练集中的损失为{train_loss}")
            print(f"第{epoch + 1}轮验证集中的损失为{test_loss}")

    train_process = pd.DataFrame(data={"epoch": range(num_epochs),
                                       "train_loss_all": train_loss_all,
                                       "val_loss_all": val_loss_all,
                                        })
    return train_process

def matplot_acc_loss(train_process):
    # 显示每一次迭代后的训练集和验证集的损失函数和准确率
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 1, 1)
    plt.plot(train_process['epoch'], train_process.train_loss_all, "ro-", label="Train loss")
    plt.plot(train_process['epoch'], train_process.val_loss_all, "bs-", label="Val loss")
    plt.legend()
    plt.xlabel("epoch")
    plt.ylabel("Loss")
    plt.show()

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train_process =train(poly_features[:n_train, :4], poly_features[n_train:, :4],labels[:n_train], labels[n_train:])
matplot_acc_loss(train_process)