2020 Neurips

1 intro & abstract

视频表征的一个挑战是高维、动态、各个像素之间多模态分布
- 最近的一些研究通过探索视频的inductive bias，并将高维数据映射到低微数据中
- —>这种方法通过将视频的各帧分解成语义上有意义的因子，来获得视频的解耦表征
- ——>但是，当物体在视频中有缺失时，现存的方法并不能很好地进行建模
这篇论文就希望学习带有缺失数据的视频的解耦表征
- 提出了DIVE (disentangled-imputed-video autoencoder）
  - 通过将视频分解成appearance、pose和missingness 这三个隐变量，来学习视频的表征
  - 通过学习到的解耦隐变量，来补全视频中的缺失数据
  - 使用补全了的视频表征进行随机的、无监督的视频预测

2 related work

基于VAE的

Learning to decompose and disentangle representations for video prediction NIPS 2018
Sequential attend, infer, repeat: Generative modelling of moving objects. NIPS 2018
Compositional video prediction.NIPS 2018
Unsupervised learning of disentangled and interpretable representations from sequential data.NIPS 2017
Structured object-aware physics prediction for video modeling and planning. ICLR 2020
R-sqair: Relational sequential attend, infer, repeat.，2019 arxiv

基于GAN的

Unsupervised learning of disentangled representations from video. NIPS 2017
Decomposing motion and content for natural video sequence prediction. arxiv 2019
Stochastic video generation with a learned prior. ICML2018

基于加&乘的

Structured object-aware physics prediction for video modeling and planning. ICLR 2020

对于视频数据而言，最常用的做法是将视频帧编码成隐变量，然后将隐藏表征分解成内容和动态因子（content，dynamics）
- 视频中的内容（物体、背景。。。）是固定的
- 视频中物体的方位则会一直改变
- ——>但大部分模型只能解决没有缺失数据的视频

视频预测一般是基于过去的视频帧来预测未来的视频帧
- 使用LSTM，ConvLSTM，PredRNN等模型
- 但这些模型的问题是，他们预测的都是确定值（帧），这并不能很好地建模视频数据中未来帧的不确定性
论文中使用随机视频预测，这能更好地捕获环境中的随机动力学

补全模型根据missingness标签 $z_{i,m}^t$ 来更新隐藏状态
如果没有丢失数据，那么补全模型的隐藏状态更新方式为：
- - 这里 $f_{enc}$ 是双向LSTM
  - i-1是上一个物体（但是视频里面物体怎么排序，我好像在这篇论文里没有找到，熟悉这一领域的欢迎补充）
如果有缺失值的话中yt是得不到的，故而需要补全，记此时需要补全的内容的隐藏状态为
- - FC是全连接层， $\mathbf{h}_{i, p}^{t-1}$ 是这一小节要介绍的pose的隐藏状态
记隐藏层的向量为，他回根据不同的 missingness标签来选择不同的隐藏状态
- $\mathbf{u}_i^t=\left\{\begin{array}{ll} \hat{\mathbf{h}}_{i, y}^t & \mathbf{z}_{i, m}^t=1 \\ \gamma \mathbf{h}_{i, y}^t+(1-\gamma) \hat{\mathbf{h}}_{i, y}^t & \mathbf{z}_{i, m}^t=0 \end{array}, \quad \gamma \sim \operatorname{Bernoulli}(p)\right.$
- 这里当没有丢失数据的时候，这边使用的是 $h_{i,y}^t,\hat{h_{i,y}^t}$ 的混合，论文发现这样效果更好
- 输入只是带缺失值的y，所以我们并不能直接知道missingness标签 $z_{i,m}^t$ ，这个值到底是0还是1，是通过后面的3.2.1 missingness inference得到的

pose的隐藏状态通过LSTM来更新
- $\mathbf{h}_{i, p}^t=\operatorname{LSTM}\left(\mathbf{h}_{i, p}^{t-1}, \mathbf{u}_i^t\right)$

一开始我们只有视频数据y，怎么得到z呢

对于missingness变量，使用如下的方式推断
- $\mathbf{z}_{i, m}^t=H(x), \quad x \sim \mathcal{N}\left(\mu_m, \sigma_m^2\right), \quad\left[\mu_m, \sigma_m^2\right]=\operatorname{FC}\left(\mathbf{h}_{i, y}^t\right), \quad H(x)= \begin{cases}1 & x \geq 0 \\ 0 & x<0\end{cases}$

$\beta_i^t \sim \mathcal{N}\left(\mu_p, \sigma_p^2\right),\left[\mu_p, \sigma_p^2\right]=\operatorname{FC}\left(\mathbf{h}_{i, p}^t\right)$

appearance变量是一个随时间一直变化的内容
- 论文这里把appearance分解成静态分量 $a_{i,s}$ 和动态分量 $a_{i,d}$
对于静态分量，作者使用“Learning to decompose and disentangle representations for video prediction.”中的inverse affine spatial transformation
- $\mathbf{a}_{i, s}=\operatorname{FC}\left(\mathbf{h}_{i, a}^K\right), \quad \mathbf{h}_{i, a}^{t+1}= \begin{cases}\operatorname{LSTM}_1\left(\mathbf{h}_{i, a}^t, \mathcal{T}^{-1}\left(\mathbf{y}^t ; \mathbf{z}_{i, p}^t\right)\right) & t<K \\ \operatorname{LSTM}_2\left(\mathbf{h}_{i, a}^t\right) & K \leq t<T\end{cases}$
- （对未来视频的预测，就是一种自回归的方式了（t的hidden state是t+1的input）
对于动态分量，作者建模的是各帧之间的区别
- $\mathbf{a}_{i, d}^1=\mathrm{FC}\left(\left[\mathbf{a}_{i, s}, \mathcal{T}^{-1}\left(\mathbf{y}^1 ; \mathbf{z}_{i, p}^1\right)\right]\right), \quad \mathbf{a}_{i, d}^{t+1}=\mathbf{a}_{i, d}^t+\delta_{i, d}^t, \quad \delta_{i, d}^t=\mathrm{FC}\left(\left[\mathbf{h}_{i, a}^t, \mathbf{a}_{i, s}\right]\right)$
最后的appearance是将动态和静态结合在一块得到的
- $q\left(\mathbf{z}_{i, a} \mid \mathbf{y}^{1: K}\right)=\prod_t \mathcal{N}\left(\mu_a, \sigma_a^2\right), \quad\left[\mu_a, \sigma_a^2\right]=\mathrm{FC}\left(\left[\mathbf{a}_{i, s}, \gamma \mathbf{a}_{i, d}^t\right]\right), \quad \gamma \sim \operatorname{Bernoulli}(p)$

给定带有丢失数据的视频 $\left(\mathbf{y}^1, \cdots, y^t\right)$ ，记潜在的完整视频为 $\left(\mathbf{x}^1, \cdots \mathbf{x}^t\right)$ ，那么，视频序列的生成概率分布为： $p\left(\mathbf{y}^{1: K}, \mathbf{x}^{K+1: T} \mid \mathbf{z}^{1: T}\right)=\prod_{i=1}^N p\left(\mathbf{y}_i^{1: K} \mid \mathbf{z}_i^{1: K}\right) p\left(\mathbf{x}_i^{K+1: T} \mid \mathbf{z}_i^{K+1: T}\right)$
其中每一个object的概率可以用如下方式计算而得 $p\left(\mathbf{y}_i^t \mid \mathbf{z}_{i, a}^t\right)=\mathcal{T}\left(f_{\operatorname{dec}}\left(\mathbf{z}_{i, a}^t\right) ; \mathbf{z}_{i, p}^t\right) \circ\left(1-\mathbf{z}_{i, m}^t\right), \quad p\left(\mathbf{x}_i^t \mid \mathbf{z}_{i, a}^t\right)=\mathcal{T}\left(f_{\operatorname{dec}}\left(\mathbf{z}_{i, a}^t\right), \mathbf{z}_{i, p}^t\right)$