神经网络不确定性综述(Part I)——A survey of uncertainty in deep neural networks

0. 概述

随着神经网络技术在现实世界中的应用不断广泛，神经网络预测置信度变得越来越重要，尤其是在医学图像分析与自动驾驶等高风险领域。然而，最基本的神经网络并不包含置信度估计的过程，并且通常面临着over-confidence或者under-confidence的问题。针对此问题，研究人员开始关注于量化神经网络预测中存在的uncertainty，由此定义了不同类型、不同来源的uncertainty以及量化uncertainty的技术。

本篇文章尝试对神经网络中的不确定性估计方法进行总结和归纳，将uncertainty的来源分为reducible model uncertainty以及irreducible data uncertainty两种类别，介绍了基于确定性神经网络(deterministic neural networks)、贝叶斯神经网络(Bayesian neural networks)、神经网络集成(ensemble of neural networks)以及测试时数据增强(test-time data augmentation)等不确定性建模方法。

1. Introduction

深度神经网络(Deep Neural Networks, DNN)在mission- and safety-critical real world applications上存在局限，具体表现为：

DNN inference model的表达能力(expressiveness)和透明度(transparency)不足，导致它们产生的预测结果难以信服——可解释性差；
DNN无法区分in-domain与out-of-domain的样本，对domain shifts十分敏感——泛化性差；
无法提供可靠的不确定性估计，并且趋向于产生over-confident predictions——过度自信；
对adversarial attacks的敏感性导致DNN很容易遭到攻击而被破坏——系统稳定性差。

造成以上现象出现的主要原因有两种，一种是数据本身引入的不确定性，即data uncertainty；另一种是神经网络学到的知识不足所造成的不确定性，即model uncertainty。为了克服DNN的这些局限性，uncertainty estimation就至关重要。有了不确定性估计，人类专家就可以通过查看模型预测结果所对应的不确定性，忽略那些不确定性很高的结果。

不确定性估计不仅对于高风险领域的安全决策有帮助，在数据源高度不均匀(inhomogeneous)以及标注数据较少的领域也十分关键，并且对于以不确定性作为学习技术关键部分的领域例如active learning及reinforcement learning同样至关重要。

近年来，研究者在DNN的不确定性估计方面展现了与日俱增的兴趣。估计预测不确定性(predictive uncertainty)最常用的方法是分别对由模型引起的不确定性(认知不确定性/模型不确定性，epistemic or model uncertainty)和由数据引起的不确定性(任意不确定性/数据不确定性，aleatoric or data uncertainty)进行建模。其中模型不确定性是可以降低的，而数据不确定性则无法降低。

目前，对于这两种不确定性最常见的建模方法包括：

Bayesian inference
Ensemble approaches
Test-time augmentation
Single deterministic networks containing explicit components to represent the model and the data uncertainty.

然而，在高风险领域仅仅对predictive uncertainty进行估计还远远不够，还需要进一步确认估计得到的uncertainty是否可靠。为此，研究人员研究了DNN的校准特性(the degree of reliability，可靠性程度)，并提出了重新校准(re-calibration)方法，以获得可靠的(well-calibrated，校准良好的)不确定性估计。

下面几个章节将具体介绍不确定性的来源、类型、DNN中不确定性估计的方法、评估不确定性估计质量的度量方法、calibrate DNN的方法、被频繁使用的evaluation datasets与benchmarks、不确定估计在现实世界的应用以及该领域现有的挑战与未来展望。

2. Uncertainty in deep neural networks

一个神经网络就是一个被模型参数 $\theta$ 所参数化的非线性函数 $f_\theta$ ，这个函数会将可测的输入集 $\mathbb{X}$

映射到另一个可测的输出集 $\mathbb{Y}$ 上，即： $f_\theta:\mathbb{X}\rightarrow\mathbb{Y},f_\theta(x)=y$ .

在supervised setting下，进一步地，对于有限训练集 $\mathcal{D}\subseteq\mathbb{D}=\mathbb{X}\times\mathbb{Y}$ ，其中包含 $N$ 个图像-标签对，即

$\mathcal{D}=(\mathcal{X},\mathcal{Y})=\{{x_n,y_n}\}_{n=1}^N\subseteq\mathbb{D}.$

此时对于新的数据样本 $x^\ast\in\mathbb{X}$ ，在 $\mathcal{D}$ 上训练得到的神经网络可以预测其对应的target，

$f_\theta(x^\ast)=y^\ast.$

现在，我们考虑从环境中的原始信息(raw information)到最终神经网络产生的带有uncertainty的prediction的整个过程中包含的四个不同的步骤：

the data acquisition process——在环境中出现的一些information (e.g. a bird’s singing)以及对这些信息的measured observation (e.g. an audio record).
the DNN building process——网络的设计和训练
the applied inference model——inference阶段使用的model
the prediction’s uncertainty model——对神经网络and/or数据产生的uncertainty的建模

事实上，这四个步骤包含了几个潜在的uncertainty/error来源，并进一步影响神经网络的最终预测。作者认为造成神经网络prediction中uncertainty的最重要的五个因素是：

真实世界的多变性variability
测量系统固有的误差
DNN架构规范(architecture specification)中的误差，例如每一层节点数量、激活函数等
DNN训练过程中的误差
由unknown data产生的误差 $y|\omega\sim p_{y|\omega}$

接下来，我们将具体介绍以上四个步骤/五个错误来源从而产生不确定性的具体细节。

2.1 Data acquisition

对于监督学习而言，data acquisition描述了measurements $x$ 以及target variables $y$ 的生成过程，从而在某个space $\Omega$ 上表示真实世界的situation $\omega$ .

In the real world, a realization of $\omega$ could for example be a bird, $x$ a picture of this bird, and $y$ a label stating ‘bird’. During the measurement, random noise can occur and information may get lost. We model this randomness in $x$ by

$x|\omega\sim p_{x|\omega}$

Equivalently, the corresponding target variable $y$ is derived, where the description is either based on another measurement or is the result of a labeling process*. For both cases, the description can be affected by noise and errors and we state it as

$y|\omega\sim p_{y|\omega}$

*NOTE: In many cases one can model the labeling process as a mapping from $\mathbb{X}$ to $\mathbb{Y}$ ,e.g. for speech recognition or various computer vision tasks. For other tasks, such as earth observation, this is not always the case. Data is often labeled based on high-resolution data while low-resolution data is utilized for the prediction task.

A neural network is trained on a finite dataset of realizations of $x|\omega_i$ and $y|\omega_i$ based on $N$ real world situations $\omega_1,\ldots,\omega_N$ ,

$\mathcal{D}={x_i,y_i}_{i=1}^N$

当我们收集数据并利用其对神经网络进行训练时，有2个因素会使神经网络产生uncertainty。

第一个因素源自于真实世界复杂可变的situations。例如，一株植物在雨后和干旱两个场景下所呈现出的外观是不同的。我们只能在采集数据时尽可能地覆盖而无法穷举所有情形，这导致神经网络在面临distribution shifts时表现不佳。

Factor I: Variability in real-world situations

Most real-world environments are highly variable and almost constantly affected by changes. These changes affect parameters such as temperature, illumination, clutter, and physical objects’ size and shape. Changes in the environment can also affect the expression of objects, such as plants after rain look very different from plants after a drought. When real-world situations change compared to the training set, this is called a distribution shift. Neural networks are sensitive to distribution shifts, which can lead to significant changes in the performance of a neural network.

第二个因素是测量系统(measurement system)本身，它直接影响了sample与其相应target的相关性。The measurement system generates information $x_i$ and $y_i$ that describe $\omega_i$ . 这意味着高度不同的现实世界场景却拥有相似的measurement或者targets。比如，对于city和forest两种situation而言，测量系统测量得到的温度temperature很可能比较相近( $x$ 相近)；另外，label noise也可能导致它们的targets相近，比如将二者都标注为forest( $y$ 相近)。

Factor II: Error and noise in measurement systems

The measurements themselves can be a source of uncertainty on the neural network’s prediction. This can be caused by limited information in the measurements, such as the image resolution. Moreover, it can be caused by noise, for example, sensor noise, by motion, or mechanical stress leading to imprecise measures. Furthermore, false labeling is also a source of uncertainty that can be seen as an error or noise in the measurement system. It is referenced as label noise and affects the model by reducing the confidence on the true class prediction during training. Depending on the intensity, this type of noise and errors can be used to regularize the training process and to improve robustness and generalization.

2.2 Deep neural network design and training

神经网络的设计包括显式建模(explicit modeling)以及随机的训练过程。由设计和训练神经网络引发的对问题结构的假设称为归纳偏置(inductive bias)。具体而言，归纳偏置可理解为建模者在建模阶段根据先验知识所采取的策略。比如，对于网络的structure而言，涉及网络参数量、层数、激活函数的选择等；对于网络的training process而言，涉及优化算法、正则化、数据增强等。这些策略的选择直接影响了模型最终的性能，而网络的structure带来了神经网络预测具有不确定性的第三个因素。

Factor III: Errors in the model structure

The structure of a neural network has a direct effect on its performance and therefore also on the uncertainty of its prediction. For instance, the number of parameters affects the memorization capacity, which can lead to under- or over-fitting on the training data. Regarding uncertainty in neural networks, it is known that deeper networks tend to be overconfident in their soft-max output, meaning that they predict too much probability on the class with the highest probability score.

对于给定的网络结构 $s$ 以及训练数据集 $\mathcal{D}$ ，神经网络的训练是一个随机的过程，因此所产生的网络 $f_\theta$ 基于随机变量 $\theta|D,s\sim p_{\theta|D,s}$ 。网络训练过程中的随机性有很多，比如随机决策，比如数据顺序、随机初始化、随即正则化如augmentation或dropout，这使得网络的loss是高度非线性的，从而导致不同的局部最优解 $\theta^\ast$ ，从而产生不同的模型；此外，batch size、learning rate以及epoch等超参数也会影响训练结果，产生不同的模型。神经网络的这种对训练过程的敏感性产生了第四个不确定性因素。

Factor IV: Errors in the training procedure

The training process of a neural network includes many parameters that have to be defined (batch size, optimizer, learning rate, stopping criteria, regularization, etc.), and also stochastic decisions within the training process (batch generation and weight initialization) take place. All these decisions affect the local optima and it is therefore very unlikely that two training processes deliver the same model parameterization. A training dataset that suffers from imbalance or low coverage of single regions in the data distribution also introduces uncertainties on the network’s learned parameters, as already described in the data acquisition. This might be softened by applying augmentation to increase the variety or by balancing the impact of single classes or regions on the loss function.

由于训练过程是基于给定的训练数据集 $\mathcal{D}$ 的，因此data acquisition process中的error(比如label noises)也会导致training process中的error。

2.3 Inference

Inference描述了神经网络对新数据样本 $x^\ast$ 的输出 $y^\ast$ 的预测。在这种情况下，网络的训练是针对于特定的任务的。因此，不符合该任务输入的样本会产生error，因此也是uncertainty的来源之一。

Factor V: Errors caused by unknown data

Especially in classification tasks, a neural network that is trained on samples derived from a world $\mathcal{W}_1$ can also be capable of processing samples derived from a completely different world $\mathcal{W}_2$ . This is for example the case when a network trained on images of cats and dogs receives a sample showing a bird. Here, the source of uncertainty does not lie in the data acquisition process, since we assume a world to contain only feasible inputs for a prediction task. Even though the practical result might be equal to too much noise on a sensor or complete failure of a sensor, the data considered here represents a valid sample, but for a different task or domain.

这种误差并不是data acquisition process所造成的，而是由未知数据产生的。

2.4 Predictive uncertainty model

神经网络预测中包含的不确定性可以分为三类：

data uncertainty [also statistical or aleatoric uncertainty]
model uncertainty [also systemic or epistemic uncertainty]
distributional uncertainty (caused by examples from a region not covered by the training data)

2.4.1 Model and data uncertainty

The model uncertainty源于模型的缺陷。这些缺陷可能是模型本身结构有问题，也可能是在训练进程中引入错误、或者由于unknown data/训练集对真实世界覆盖能力差 (bad coverage)而导致得到的模型缺乏足够的knowledge。

The data uncertainty指的是源于数据本身的缺陷，其根本原因在于information loss，使得无法完美地表示真实世界，样本不包含足够的信息以100%的certainty识别某个类别。

信息损失本身也体现在两个方面，一方面是source (data)，比如低分辨率图像会丢失表示真实世界的信息；另一方面是target (label)，比如在labelling process中出现的错误。

现在让我们回顾一下五种造成uncertain prediction的因素：

Factor I 真实世界的多样性/多变性
Factor II 测量系统的缺陷
Factor III 模型结构的缺陷
Factor IV 模型随机训练过程中的错误
Factor V 未知数据的干扰

可以看到，只有Factor II属于不可消除的aleatoric uncertainty，因为它造成了数据本身的缺陷即insufficient data，从而导致预测变得不可靠。其余的Factor全部属于epistemic uncertainty，是可以消除的。

理论上model uncertainty可以通过改进model architecture、learning process或者training dataset来降低，而data uncertainty无法被消除。因此，对于real-world applications而言，如果一个模型能够remove or quantify the model uncertainty and give a correct prediction of the data uncertainty将至关重要。(此处有supervision的含义在里面)

在众多方法中，Bayesian framework提供了一个实用的工具来推理深度学习中的uncertainty。在贝叶斯框架中，model uncertainty被形式化为模型参数的概率分布，而data uncertainty被形式化为模型(参数为 $\theta$ )的output $y^\ast$ 的概率分布。

预测 $y^\ast$ 的概率分布为

$p(y^*|x^*,D)=\int p(\underbrace{y^*|x^*,\theta)}_{\mathrm{Data}}\underbrace{p(\theta|D)}_{\mathrm{Model}}d\theta.$

其中， $p(\theta|D)$ 是模型参数的posterior，描述了给定训练集 $D$ 后模型参数的不确定性。通常情况下，后验分布是intractable的，为了得到(近似)后验分布，ensemble approaches尝试学习不同的parameter settings并对结果做平均来近似后验分布；而Bayesian inference利用贝叶斯定理将其reformulate为

$p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)}$

其中 $p(\theta)$ 不考虑任何information而仅仅考虑 $\theta$ 因此被称为模型参数的先验分布prior， $p(D|\theta)$ 代表 $D$ 是模型(参数为 $\theta$ )预测所产生输出分布的一个realization，称为likelihood。许多损失函数都是由似然函数驱动的，寻找最大化对数似然的损失函数的例子如交叉熵或均方误差。

然而，即使我们使用贝叶斯定理将后验分布reformulate为以上式子， $p(y^\ast|x^\ast,D)$ 仍然是intractable的。为了解决这个问题，一系列方法被逐渐提出，在Section 3会具体介绍。

2.4.2 Distributional uncertainty

The predictive uncertainty可以进一步地被划分为data、model以及distributional uncertainty三个部分，

$p(y^*|x^*,D)=\int\int\underbrace{p(y^*|\mu)}_{\mathrm{Data}}\underbrace{p(\mu|x^*,\theta)}_{\text{Distributional}}\underbrace{p(\theta|D)}_{\mathrm{Model}}d\mu d\theta.$

如何理解其中的distributional part？我们可以这样考虑，uncertainty意为不确定性，可以用一个分布进行刻画，比如model uncertainty可以由模型参数 $\theta$ 的分布得到。同理，distributional uncertainty可以由distribution的分布得到，即，分布的分布；例如对于分类任务而言， $p\left(\mu\middle| x^\ast,\theta\right)$ 可能是Dirichlet distribution，指的是最终由soft-max给出的分类分布所服从的分布。再进一步，data uncertainty代表最终预测结果 $y^\ast$ 的分布(以 $\mu$ 为参数)。

根据这种建模方式，distributional uncertainty表示由于input-data distribution发生改变所引入的不确定性；model uncertainty表示在model building and training process中引入的不确定性。model uncertainty会影响distributional uncertainty的估计，distributional uncertainty又会进一步影响data uncertainty的估计。

2.5 Uncertainty classification

On the basis of the input data domain, the predictive uncertainty can also be classified into 3 main classes:

In-domain uncertainty

In-domain uncertainty represents the uncertainty related to an input drawn from a data distribution assumed to be equal to the training data distribution. 域内不确定性源于深度神经网络lack of in-domain knowledge而无法解释域内样本，从modeler的角度看，其原因一方面是design errors (model uncertainty)，另一方面是当前任务的复杂性 (data uncertainty)。因此，根据这两类来源，我们可以通过提高训练集的质量或者优化训练过程来减少in-domain uncertainty。

Domain-shift uncertainty

Domain-shift uncertainty denotes the uncertainty related to an input drawn from a shifted version of the training distribution. 域偏移不确定性源于训练数据集的insufficient coverage，其原因可能是数据集本身的收集不当，也可能是真实世界的多变性。从modeler的角度看，domain-shift uncertainty是源于external or environmental factors。这种uncertainty可以通过cover the shifted domain in the training dataset来减少。

Out-of-domain uncertainty

Out-of-domain uncertainty represents the uncertainty related to an input drawn from the subspace of unknown data. 未知数据的分布远不同于训练数据集的分布，比如将鸟输入到猫狗训练器。从modeler的角度看，out-of-domain uncertainty源于输入样本，即insufficient training data。

我们发现，上述三种uncertainty的根本都是因为训练数据集的insufficiency。再将视角回到model uncertainty与data uncertainty上，显然，模型不确定性的主要原因同样是insufficient training dataset，因此in-domain、domain shift、out-of-domain uncertainty都能产生model uncertainty；而相比之下，data uncertainty则更多与in-domain uncertainty相关，例如overlapping samples and systematic label noise，而这些误差均来自于域内数据的模糊/噪声。