第三十四周学习周报

摘要

在本周阅读的文献中，作者提出了一种将注意力机制与LSTM相结合的模型AT-LSTM。虽然传统LSTM通过其门控机制能有效捕捉时间序列中的长期依赖关系，但是其在处理多变量时间序列时不能充分关注关键特征对预测结果的影响。AT-LSTM模型，通过在LSTM特征提取后引入注意力机制，增强关键时刻特征的权重，从而提升预测精度。它通过自适应加权减少无关特征的干扰，使模型更聚焦于影响溶解氧的关键变量。除此作者还通过使用滑动窗口将时间序列转为监督数据，并采用贝叶斯优化调整超参数，确保模型性能最优。

Abstract

In the paper read this week, the author proposed a model called AT-LSTM that combines attention mechanism with LSTM. Although traditional LSTM can effectively capture long-term dependencies in time series through its gating mechanism, it cannot fully focus on the impact of key features on prediction results when dealing with multivariate time series. The AT-LSTM model enhances prediction accuracy by introducing attention mechanism after LSTM feature extraction to increase the weight of key moment features. It reduces the interference of irrelevant features through adaptive weighting, making the model more focused on the key variables that affect Dissolved Oxygen. In addition, the author also converted the time series into supervised data by using sliding windows and adjusted hyperparameters using Bayesian optimization to ensure optimal model performance.

1 文献阅读

本周阅读了一篇名为Water Quality Prediction Based on LSTM and Attention Mechanism: A Case Study of the Burnett River, Australia的论文。
论文地址：Water Quality Prediction Based on LSTM and Attention Mechanism
在这里插入图片描述

论文提出了一种基于长短时记忆网络和注意力机制的混合模型AT-LSTM，用于预测伯内特河的溶解氧（DO）含量。

1.1 相关知识

1.1.1 贝叶斯优化

贝叶斯优化是一种用于优化计算成本高或实验成本高的黑箱函数的全局优化方法。在机器学习中常用其优化模型参数，如学习率、正则化系数等。它特别适用于目标函数没有解析形式、导数不可用或评估一次代价很高的情况。
其核心思想是先用如高斯过程来近似目标函数，并估计其不确定性。后通过一个函数来平衡探索和利用，决定下一步在哪里采样。最后根据采样结果更新代理模型，逐步逼近全局最优解。
（1）高斯过程
高斯过程主要用于建模函数的分布
在这里插入图片描述

（2）采集函数
采集函数决定了下一步采样点，通常使用期望函数（EI）
在这里插入图片描述

后通过不断迭代，使得达到最大迭代次数或 EI 收敛时停止。以下通过一个例子进行学习：
在这里插入图片描述

以下是上述例子的代码实现：

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.optimize import minimize
from scipy.stats import norm

# 目标函数：f(x) = -x^2 + 4x
def objective_function(x):
    return -x[0] * x[0] + 4 * x[0]  # x[0] 是因为输入仍为数组形式

# 采集函数：期望改进
def acquisition_function(x, gp, y_max):
    mean, std = gp.predict(np.array([x]), return_std=True)
    z = (mean - y_max - 0.01) / std
    return (mean - y_max - 0.01) * norm.cdf(z) + std * norm.pdf(z)

# 绘制目标函数曲线
x = np.linspace(0, 4, 100)  # 定义域 [0, 4]
y = -x * x + 4 * x  # 计算 f(x)

plt.figure(figsize=(10, 7))
plt.plot(x, y, 'b-', label='Objective Function: f(x) = -x^2 + 4x')
plt.title('Objective Function')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)

# 初始化随机采样点（一维）
initial_points = np.random.uniform(0, 4, (5, 1))  # 5个点，每点维度为1
initial_values = np.array([objective_function(x) for x in initial_points])

# 高斯过程模型
kernel = Matern(nu=2.5)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(initial_points, initial_values)

# 贝叶斯优化过程
n_iter = 15
for i in range(n_iter):
    y_max = max(initial_values)
    # 优化采集函数，注意 x0 和 bounds 改为 1 维
    res = minimize(lambda x: -acquisition_function(x, gp, y_max), 
                   x0=np.random.uniform(0, 4, 1), 
                   bounds=[(0, 4)])
    next_sample = res.x
    next_value = objective_function(next_sample)
    
    initial_points = np.vstack((initial_points, next_sample))
    initial_values = np.append(initial_values, next_value)
    
    gp.fit(initial_points, initial_values)
    
    plt.scatter(next_sample[0], next_value, c='red')

# 绘制初始点和采样点
plt.scatter(initial_points[:, 0], initial_values, c='black', label='Samples')
plt.legend()
plt.show()

# 输出最优结果
best_idx = np.argmax(initial_values)
print(f"最佳 x = {initial_points[best_idx, 0]:.4f}")
print(f"最大值 f(x) = {initial_values[best_idx]:.4f}")

所得结果如下所示：
在这里插入图片描述
控制台输出结果如下所示

最佳 x = 2.0238
最大值 f(x) = 3.9994

其结果与真实值已十分接近

1.1.2 注意力机制复习

本次复习的是多头注意力机制，多头注意力将自注意力机制分为多个头。每个头在不同的子空间中进行自注意力计算，这样可以让模型从不同的角度捕捉信息。最后，将各个头的结果进行拼接和线性变换，以得到最终的表示。这种方式提升了模型的表达能力和信息处理能力。在理论上，Muti-Head Attention中的每一个头都需要乘上不同的W^q,W^k,W^v,下图是八头注意力机制处理的一个示例，步骤如下：
1）输入序列
2）使用Embedding进行数据转换映射
3）将其分成8个头，用权重矩阵乘以X或R
4）使用所得的Q/K/V矩阵计算注意力
5）将得到的Z矩阵连接起来，然后与权重矩阵W相乘，得到该层的输出
在这里插入图片描述
其代码实现如下：

import math
import torch
from torch import nn
from d2l import torch as d2l
def transpose_qkv(X,num_heads):
    # 输入X的形状: (batch_size, 查询或者“键－值”对的个数, num_hiddens).
    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
    # 输出X的形状: (batch_size, 查询或者“键－值”对的个数, num_heads,num_hiddens/num_heads)
    X = X.permute(0, 2, 1, 3)
    return X.reshape(-1, X.shape[2], X.shape[3])
 
def transpose_output(X,num_heads):
    #逆转 `transpose_qkv` 函数的操作
    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
    X = X.permute(0, 2, 1, 3)
    return X.reshape(X.shape[0], X.shape[1], -1)
 
#定义多头自注意力模型
class MultiHeadAttention(nn.Module):
    def __init__(self,key_size,query_size,value_size,num_hiddens,
                num_heads,dropout,bias=False,**kwargs):
        super(MultiHeadAttention,self).__init__(**kwargs)
        self.num_heads = num_heads
 
        #将Attention设置为Dot-Product Attention
        self.attention = d2l.DotProductAttention(dropout)
 
        # 将输入映射为（batch_size,query_size/k-v size,num_hidden）大小的输出     
        self.W_q = nn.Linear(query_size,num_hiddens,bias=bias) 
        self.W_k = nn.Linear(key_size,num_hiddens,bias=bias)
        self.W_v = nn.Linear(value_size,num_hiddens,bias=bias)
        self.W_o = nn.Linear(num_hiddens,num_hiddens,bias=bias)
    
    def forward(self,queries,keys,values,valid_lens):
        # `queries`, `keys`, or `values` 的形状:
        # (`batch_size`, 查询或者“键－值”对的个数, `num_hiddens`)
        # `valid_lens`　的形状:
        # (`batch_size`,) or (`batch_size`, 查询的个数)
        # 经过变换后，输出的 `queries`, `keys`, or `values`　的形状:
            # (`batch_size` * `num_heads`, 查询或者“键－值”对的个数,`num_hiddens` / `num_heads`)
        queries = transpose_qkv(self.W_q(queries), self.num_heads) 
        keys = transpose_qkv(self.W_k(keys), self.num_heads)
        values = transpose_qkv(self.W_v(values), self.num_heads) 
        
        # 将多个头的数据堆叠在一起，然后进行计算，从而不用多次计算
        if valid_lens is not None:
            valid_lens = torch.repeat_interleave(valid_lens,
                                                repeats=self.num_heads,
                                                dim=0)
        output = self.attention(queries,keys,values,valid_lens) 
        # output->(10,4,20)
        return output
        output_concat = transpose_output(output,self.num_heads) 
        # output_concat -> (2,4,100)
        return self.W_o(output_concat)
 
#使用键和值相同的小例子来测试我们编写的 MultiHeadAttention 类
#多头注意力输出的形状是（batch_size、num_queries、num_hiddens）
# 线性变换的输出为100个，5个头
num_hiddens, num_heads = 100, 5
attention = MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,num_hiddens, num_heads, 0.5)
attention.eval()
MultiHeadAttention(
  (attention): DotProductAttention(
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (W_q): Linear(in_features=100, out_features=100, bias=False)
  (W_k): Linear(in_features=100, out_features=100, bias=False)
  (W_v): Linear(in_features=100, out_features=100, bias=False)
  (W_o): Linear(in_features=100, out_features=100, bias=False)
)
# 2个batch，4个query，6个键值对
batch_size, num_queries, num_kvpairs, valid_lens = 2, 4, 6, torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens)) # query（2，4，100）
Y = torch.ones((batch_size, num_kvpairs, num_hiddens)) # key和value （2，6，100）
output = attention(X, Y, Y, valid_lens) # 输出大小与输入的query的大小相同
output.shape
#输出：torch.Size([2, 4, 100])

1.2 模型框架

作者通过引入了一种对LSTM网络的注意力机制，从而提出了AT-LSTM网络模型，其结构如下图所示，主要组件是LSTM层和注意力层：
在这里插入图片描述

将注意力机制融入LSTM其目的主要是为了处理多变量时间序列数据。该模型的主要思想是通过自适应加权神经网络的隐藏层元素，减少无关因素对结果的影响，突出相关因素的影响，从而提高预测精度。AT-LSTM结合了LSTM在时间序列建模中的长时依赖学习能力和注意力机制对关键特征的自适应加权功能，以提升水质预测的准确性和鲁棒性。模型主要分为LSTM层和注意力层两个部分。LSTM层的主要作用是捕捉时间序列的动态特征。关于LSTM的结构在之前的学习中已有详细介绍在此不过多介绍。LSTM的代码实现如下：

class LSTM:
    def __init__(self, input_size, hidden_size):
        """初始化LSTM参数"""
        self.input_size = input_size  # 输入特征数
        self.hidden_size = hidden_size  # 隐藏单元数

        # 初始化权重和偏置
        limit = 1 / math.sqrt(input_size + hidden_size)
        self.Wi = np.random.uniform(-limit, limit, (hidden_size, input_size + hidden_size))  # 输入门权重
        self.bi = np.zeros((hidden_size, 1))
        self.Wf = np.random.uniform(-limit, limit, (hidden_size, input_size + hidden_size))  # 遗忘门权重
        self.bf = np.zeros((hidden_size, 1))
        self.Wo = np.random.uniform(-limit, limit, (hidden_size, input_size + hidden_size))  # 输出门权重
        self.bo = np.zeros((hidden_size, 1))
        self.Wc = np.random.uniform(-limit, limit, (hidden_size, input_size + hidden_size))  # 候选记忆权重
        self.bc = np.zeros((hidden_size, 1))

    def forward(self, X):
        """前向传播，X形状: (time_steps, input_size)"""
        time_steps = X.shape[0]
        h = np.zeros((self.hidden_size, 1))  # 初始隐藏状态
        c = np.zeros((self.hidden_size, 1))  # 初始记忆单元
        h_history = []  # 保存所有时间步的隐藏状态

        for t in range(time_steps):
            # 当前输入与前一隐藏状态拼接
            x_t = X[t].reshape(-1, 1)  # (input_size, 1)
            combined = np.vstack((h, x_t))  # (input_size + hidden_size, 1)

            # 输入门
            i_t = sigmoid(np.dot(self.Wi, combined) + self.bi)  # (hidden_size, 1)
            # 遗忘门
            f_t = sigmoid(np.dot(self.Wf, combined) + self.bf)  # (hidden_size, 1)
            # 输出门
            o_t = sigmoid(np.dot(self.Wo, combined) + self.bo)  # (hidden_size, 1)
            # 候选记忆
            c_tilde = tanh(np.dot(self.Wc, combined) + self.bc)  # (hidden_size, 1)

            # 更新记忆单元
            c = f_t * c + i_t * c_tilde
            # 更新隐藏状态
            h = o_t * tanh(c)
            h_history.append(h.copy())

        return np.array(h_history)  # (time_steps, hidden_size, 1)

模型中的注意力模块主要由五个层次组成，第一层全连接层将LSTM的输出映射到注意力空间；后经softmax层对每个时间步的隐藏状态计算归一化权重，生成注意力分布；Multiply操作的作用是将注意力权重与LSTM输出逐元素相乘，突出关键时间步；第四层是Flatten层，它的作用是将三维张量展平为一维向量，便于后续操作；最后在接入一个全连接层将展平向量映射到单一输出。
注意力的工作原理如下图所示：
在这里插入图片描述
其代码实现如下：

class Attention:
    def __init__(self, hidden_size):
        """初始化注意力机制参数"""
        self.hidden_size = hidden_size
        self.Wa = np.random.uniform(-0.1, 0.1, (hidden_size, hidden_size))  # 注意力权重

    def forward(self, h_history):
        """前向传播，h_history形状: (time_steps, hidden_size, 1)"""
        time_steps = h_history.shape[0]
        h_flat = h_history.reshape(time_steps, self.hidden_size)  # (time_steps, hidden_size)

        # 计算注意力得分
        scores = np.dot(h_flat, self.Wa)  # (time_steps, hidden_size)
        attention_weights = softmax(scores)  # (time_steps, hidden_size)

        # 加权求和
        context = np.sum(h_flat * attention_weights, axis=0)  # (hidden_size,)
        return context, attention_weights

而AT-LSTM的主体框架代码实现如下，主要是按顺序调用LSTM、注意力机制和全连接层，输出预测值和注意力权重。

class AT_LSTM:
    def __init__(self, input_size, hidden_size, output_size):
        """初始化AT-LSTM模型"""
        self.lstm = LSTM(input_size, hidden_size)
        self.attention = Attention(hidden_size)
        self.dense = Dense(hidden_size, output_size)

    def forward(self, X):
        """前向传播，X形状: (time_steps, input_size)"""
        # LSTM层
        h_history = self.lstm.forward(X)  # (time_steps, hidden_size, 1)
        
        # 注意力层
        context, weights = self.attention.forward(h_history)  # context: (hidden_size,)
        
        # 全连接层
        output = self.dense.forward(context.reshape(-1, 1))  # (output_size, 1)
        return output, weights

AT-LSTM模型的工作流程如下：
1）将多变量水质数据输入LSTM层。
2）LSTM处理序列，生成每个时间步的隐藏状态序列，捕捉时间依赖关系
3）注意力层对隐藏状态计算权重，增强与DO相关的特征影响。例如，若某小时的pH骤变对DO影响显著，其权重会增大。
4）加权后的隐藏状态展平并通过全连接层解码为DO预测值。
相比传统LSTM，AT-LSTM的优势是通过注意力机制自适应聚焦关键时间步，避免信息冗余。此外，模型结合了LSTM的非线性建模能力和注意力机制的选择性增强，适合处理水质数据的高维、非线性特性。
在这里插入图片描述
模型的主要参数如上图所示。模型的超参数设置在一定程度上影响了其对水质预测的性能。作者通过试错法将时间窗口设置为100，使用贝叶斯优化进行模型超参数优化，并确定相对较好的超参数和激活函数。
下图是AT-LSTM的预测算法流程图：
在这里插入图片描述
其流程主要分为三步，首先进行数据清洗，再进行数据增强，最后进行预测。

1.3 实验分析

（1）数据集
数据来源于伯内特河自动监测站的历史水质数据，时间范围为2015年1月至2020年1月，每半小时记录一次，共39,752条记录。
在这里插入图片描述
数据特征包括水温（Temp）、电导率（EC）、pH、溶解氧（DO）、浊度（NTU）和叶绿素-a（Chl-a）。DO作为输出变量，其余为输入特征。

经数据清洗后按8:1:1比例将数据划分为训练集、验证集和测试集。
（2）评估标准
RMSE：衡量预测值与实际值的平均偏差，值越小越好。
在这里插入图片描述
MAE：反映预测偏差的绝对大小，值越小越好。

R²:衡量模型拟合能力，值越接近1越好。

(3)实验结果
实验通过单步预测、多步预测和模型验证三个方面对比AT-LSTM与LSTM的性能：
1）单步预测
下图(a)是LSTM和AT-LSTM模型的预测DO值与测试期间测得的DO值的比较。蓝点表示测得的和预测的溶氧值的散点图，而黑色虚线表示完美匹配，其中“测得的溶氧值=预测的溶氧值”。(b)是LSTM和AT-LSTM模型的预测DO值与测试期间实际DO值的比较。

在这里插入图片描述
由上图可知，AT-LSTM的预测值更靠近实际值，与理想线重合度更高，表明其预测偏差更小。AT-LSTM紧贴实际值曲线，而LSTM在峰值和谷值处偏差较大，显示其对突变的适应性不足。其原因是在AT-LSTM模型中注意力机制通过加权隐藏层特征，过滤了时间序列中的噪声（如平稳时期的无关波动），增强了关键时刻（如DO骤降）的建模能力。
2）多步预测
下图是在测试数据集上使用LSTM和AT-LSTM进行48小时提前预测的折线图、残差直方图和误差图。

在这里插入图片描述

由上图可知，随着预测步长增加，两模型误差均上升，但AT-LSTM的误差增长速率更慢。其主要原因是注意力机制在长序列中持续关注关键历史特征（如DO下降前的pH变化），缓解了LSTM因时间步延长而丢失信息的不足。
3）模型验证
下图是AT-LSTM模型和LSTM模型在新数据集上提前1-12 h预测的RMSE.

在这里插入图片描述
由上图可知，AT-LSTM的RMSE曲线平缓且低于LSTM曲线，验证了其在不同数据集上的优越性。这主要归功于AT-LSTM通过注意力机制自适应调整特征权重，增强了模型对新环境的适应性，而LSTM因缺乏此机制，泛化性能较弱。

总结

至此，我已经学习了三种由注意力机制与LSTM相结合的模型，在此做个总结。ALGAN中注意力机制直接作用于LSTM的输出，调整隐藏状态以减少信息丢失。DA-LSTM中，注意力机制分为两个阶段，分别处理输入特征和时间步，帮助模型更好地捕捉长时间序列中的关键信息。AT-LSTM模型，通过在LSTM特征提取后引入注意力机制，增强关键时刻特征的权重，从而提升预测精度。