1. 第一作者:Keyang Luo
2. 发表年份:2020
3. 发表期刊:CVPR
4. 关键词:MVS、代价体、注意力机制、正则化
5. 探索动机:
- However, the feature matching results from different channels are usually not of the same importance since the captured scenes could be significantly different across channels.
- Another major problem faced by learning-based MVS methods is how to effectively aggregate and regularize the matching confidence volume into a latent depth probability volume (LPV), from which the depth/disparity map then can be inferred via some regression or multi-class classification techniques.
- The multi-view ground-truth depth maps introduced in MVSNet for training MVS networks have been widely used, but they still contain quite many wrongly labeled pixels, which could potentially cause some undesired effects on training and validation.
6. 工作目标:结合注意力机制的优势,解决上述问题。
7. 核心思想:
- We design an attention-enhanced matching confidence volume, which takes account of both perceptual information and contextual information of the local scene to improve the matching robustness.
- We propose a novel attention-guided regularization module for hierarchically aggregating and regularizing the matching confidence volume in the topdown/bottom-up manner.
- We develop a simple but effective filtering strategy to improve the quality of multi-view ground-truth depth maps for network training.
8. 实验结果:
Our method achieves the best overall performance on the DTU benchmark and the intermediate sequences of Tanks & Temples benchmark over many state-of-theart MVS approaches.
1. AttMVS结构
- 首先使用编码器网络从输入图像中提取感知特征;
- 使用特征构建注意力增强匹配置信体实现鲁棒和准确匹配;
- 通过注意力引导的分层正则化模块RFMs)对匹配置信度进行分层聚合和正则化;
- 使用3D卷积从正则化置信体估计深度图;
- “⊙”表示信道相乘,“⊚”表示单应性变化和原始像素置信度匹配,R’i和Ri分别为第i级上的非正则化和正则化匹配置信体。
2. 特征提取器
使用修改后的P-mvsnet的特征提取器提取参考图像I0和N张大小为H × W的源图像的感知特征,用于学习多视图光度一致性。特征提取网络必须具有足够的容量,对于获得精确和鲁棒的特征表示进行像素级匹配至关重要,因此增加了通道数和层数(10)。输出维度[1/4H, 1/4W, 16],使用Instance Normalization, LeakyReLU。
3. 注意力增强匹配置信度
通过全局average pooling把每个特征体压缩到单通道vi(individual channel,个人是这么理解的)
对于j = 0,1,····,Z-1,其中Z是采样假设深度平面的总数,⊙表示通道乘法,Mj表示基于单应性变化的特征图的中生成的原始像素置信度图。图中学习权重的例子可以看出:i)不同的场景在一些通道拥有不同的权重,而在另一些通道拥有相似的权重;(2)对于每个场景,不同的通道拥有不同的权重。
4. 注意力引导的层次正则化
两个上下文理解模块都由三个3D卷积块组成,其中前上下文模块中的第二个块对匹配置信体进行下采样并增加通道,后上下文模块中的第二个块进行反向操作。 l -1层(l=1,2,3)的RAM可以表述为:
其中,Re l-1是由R'l-1提供的前上下文理解模块的输出,Rl是level l上正则化的匹配置信体,⊕表示逐元素相加,射线加权图wr*由wr =|Re-1-R|用与式(2)相同。然后,通过后上下文理解模块对R*l-1进行进一步处理,得到正则化匹配置信度Rl-1。
5. 深度回归和损失函数
其中λ > 0为加权系数。相对深度损失函数定义为
其中Nd表示真实像素的总数(i, j),δ = (Dmax-Dmin)/(Z-1)是假设深度平面之间采样间隔的长度,d∗是真实深度。为了保证预测深度图与真实深度图之间深度梯度的一致性,定义梯度间正则化损失为:
6. 点云重建
7. 实验
7.1. 实现
先计算单应性变化。We also notice that all scanned scenes share the same set of camera parameters and the adjacent relationships between the cameras are also fixed. Therefore, we pre-calculate all possible homography transformations in advance and directly use them during training of the network, which reduces the training time of each minibatch from around 1.8s to 1.2s (saves about one-third of the training time).
7.2. 基准结果及泛化性
Tanks & Temples数据集重叠度很大,点云融合时采用了更严格的阈值,抑制异常值。在advanced中的效果没有intermediate好,是由于深度范围特别大,这也是深度方法普遍的限制。
The performance of our method on the advanced sequences is still competitive but is worse than that on the intermediate sequences when compared with some conventional MVS methods. We think the main reason is that for great majority part of images in the advanced sequences, the interested depth ranges are very large, but due to the GPU memory limitation, our method could not sample sufficient hypothesized depth planes to assure the
quality of predicted depth maps even though the depth map refinement has been used. Thus, our method suits better to reconstruct the scenes with the interested depth range of the captured images being concentrated, which is also the common restriction of current learning-based MVS algorithms.