神经网络不确定性综述(Part III)——Uncertainty estimation

神经网络不确定性综述(Part III)——Uncertainty estimation_Bayesian neural networks

news2025/4/13 21:03:58

相关链接：

神经网络不确定性综述(Part I)——A survey of uncertainty in deep neural networks-CSDN博客

神经网络不确定性综述(Part II)——Uncertainty estimation_Single deterministic methods-CSDN博客

神经网络不确定性综述(Part III)——Uncertainty estimation_Bayesian neural networks-CSDN博客

神经网络不确定性综述(Part IV)——Uncertainty estimation_Ensemble methods&Test-time augmentation-CSDN博客

神经网络不确定性综述(Part V)——Uncertainty measures and quality-CSDN博客

3.2 Bayesian neural networks

贝叶斯神经网络将神经网络与贝叶斯学习相结合，通过推断网络参数 $\theta=(w_1,\ldots,w_K)$ 的概率分布来实现。具体来说，给定input-target-pair $(x,y)$ ，假设参数的先验分布 $p(\theta)$ ，并使用贝叶斯定理，对后验分布的参数空间进行建模：

$p(\theta|x,y)=\frac{p(y|x,\theta)p(\theta)}{p(y|x)}\propto p(y|x,\theta)p(\theta)$

where $p(y|x)=\int p(y|x,\theta)p(\theta)d\theta.$

估计得到权重参数的后验分布后，the prediction output $y^\ast$ for a new data $x^\ast$ can be obtained by Bayesian Model Averaging or Full Bayesian Analysis that involves marginalizing the likelihood $p(y|x,\theta)$ with the posterior distribution $p(\theta|x,y)$ ：

$p(y^*|x^*,x,y)=\int p(y^*|x^*,\theta)p(\theta|x,y)d\theta$

上式的积分是intractable的，因此通常采用近似的手段，其中使用最广泛的近似方法是Monte Carlo Approximation。此方法遵循大数定律，通过N个随机网络的预测均值来逼近期望值，

$y^*\approx\frac1N\sum_{i=1}^Ny_i^*=\frac1N\sum_{i=1}^Nf_{\theta_i}(x^*)$

Wilson and Izmailov (2020)认为BNN的一个关键优势就在于marginalization，它可以提高深度神经网络的accuracy and calibration。此外，BNN不仅局限于uncertainty estimation，还为深度学习提供了强大的Bayesian toolboxes，例如，Bayesian model selection, model compression, active learning, theoretic advances. 虽然这个formulation是很简单的，但是BNN仍然存在挑战，例如，对于后验推断通常不存在闭式解，因为复杂模型，如神经网络，一般不存在共轭先验(Bishop and Nasrabadi，2006)，因此往往需要使用approximate Bayesian inference techniques来计算后验概率。然而，直接使用近似贝叶斯推断技术已被证明是困难的，因为DNN的数据量和参数量太大，即，上述积分在数据规模和参数量增长时不易计算。此外，为DNN指定有意义的prior是另一个挑战。

作者根据how the posterior distribution is inferred to approximate Bayesian inference将BNNs分为三种类别：

Variational inference

Variational inference approaches approximate the (in general intractable) posterior distribution by optimizing over a family of tractable distributions.

Sampling approaches

Sampling approaches deliver a representation of the target random variable from which realizations can be sampled. Such methods are based on Markov Chain Monte Carlo and further extensions.

Laplace approximation

Laplace approximation simplifies the target distribution by approximating the log-posterior distribution and then, based on this approximation, deriving a normal distribution over the network weights.

3.2.1 Variational inference

The goal of variational inference is to infer the posterior probabilities $p(\theta|x,y)$ using a prespecifed family of distributions $q(\theta)$ . Here, this so-called variational family $q(\theta)$ is defined as a parametric distribution.

——变分推断的目标是利用预先指定的分布族 $q(\theta)$ 推断后验概率 $p(\theta|x,y)$ 。所谓的变分分布族 $q(\theta)$ 被定义为一个参数化分布。比如，Multivariate Normal distribution的参数为均值和协方差矩阵；变分推断的主要思想是找到这些参数，使得 $q(\theta)$ 接近所关注的后验概率 $p(\theta|x,y)$ 。而概率分布之间的接近程度由Kullback-Leibler (KL) 散度给出：

$\mathrm{KL}(q\parallel p)=\mathbb{E}_q\left[\log\frac{q(\theta)}{p(\theta|x,y)}\right]$

由于KL散度中包含有后验 $p(\theta|x,y)$ 因此无法直接优化，实际操作中是优化Evidence Lower Bound, ELBO：

$ELBO=\mathbb{E}_q\left[\log\frac{p(y|x,\theta)}{q(\theta)}\right]$

而KL散度也可以写为

$\mathrm{KL}(q\parallel p)=-L+\log p(y|x).$

最小化KL散度实际上就是最大化ELBO，二者是一致的。

其它关键词(请读者自行查阅)：重参数reparameterization；平均场假设mean-field approximations；Monte Carlo Dropout。

3.2.2 Sampling methods

Sampling methods通常被称为Monte Carlo methods，是另一种贝叶斯推断算法。该方法从分布中抽取一组样本来获得后验，不受分布类型的限制hence probability distributions are obtained non-parametrically。流行的算法包括粒子滤波、拒绝采样、重要性采样和MCMC采样。在神经网络中，因为基于拒绝采样和重要性采样的方法对于高维问题来说十分低效，MCMC是普遍被使用的方法。MCMC的主要思想是通过transition in state space从而在任意分布中进行抽样，this transition is governed by a record of the current state and the proposal distribution that aims to estimate the target distribution (e.g. the true posterior). 为了进一步解释，定义：

a Markov Chain is a distribution over random variables $x_1,\cdots,x_T$ which follows the state transition rule:

$p(x_1,\cdots,x_T)=p(x_1)\prod_{t=2}^Tp(x_t|x_{t-1})$

i.e. the next state only depends on the current state and not on any other former state.

即，下一时刻的状态只依赖于当前状态，而与之前的状态无关。

每一次采样都通过一定的规则选择接受或拒绝当前样本，这个过程一直进行下去，最终会保证在某个时间点后采样得到的样本近似来自于目标分布。

其它关键词：Hamiltonian Monte Carlo / Hybrid Monte Carlo; Stochastic Gradient Markov Chain Monte Carlo; Langevin dynamics.

其余请参见原文。

3.2.3 Laplace approximation

The goal of the Laplace Approximation is to estimate the posterior distribution over the parameters of neural networks $p(\theta|x,y)$ around a local mode of the loss surface with a Multivariate Normal distribution.

The Laplace Approximation to the posterior can be obtained by taking the second-order Taylor series expansion of the log posterior over the weights around the MAP estimate $\widehat{\theta}$ given some data $(x,y)$ . If we assume a Gaussian prior with a scalar precision value $\tau>0$ , then this corresponds to the commonly used L2-regularization, and the Taylor series expansion results in

$\begin{aligned}\log p(\theta|x,y)&\approx\log p(\widehat{\theta}\mid x,y)\\&+\frac12(\theta-\widehat{\theta})^T(H+\tau I)(\theta-\widehat{\theta})\\\end{aligned}$

where the first-order term vanishes because the gradient of the log posterior $\delta\theta=\nabla\mathrm{log~}p(\theta\mid x,y)$ is zero at the maximum $\widehat{\theta}$ . Taking the exponential on both sides and approximating integrals by reverse engineering densities, the weight posterior is approximately a Gaussian with the mean $\widehat{\theta}$ and the covariance matrix $(H+\tau I)^{-1}$ where $H$ is the Hessian of $\mathrm{log~}p(\theta\mid x,y)$ . This means that the model uncertainty is represented by the Hessian $H$ resulting in a Multivariate Normal distribution: