用于牙科的多任务视频增强

news2025/1/22 8:21:34

Multi-task Video Enhancement for Dental Interventions

2022 miccai

Abstract

微型照相机牢牢地固定在牙科手机上,这样牙医就可以持续地监测保守牙科手术的进展情况。但视频辅助牙科干预中的视频增强减轻了低光、噪音、模糊和相机握手等降低视觉舒适度的问题。为此,我们引入了一种新的深度网络,用于多任务视频增强,使牙科场景的宏观可视化。特别是,该网络以多尺度方式联合利用视频恢复和时间对齐来有效增强视频。我们对虚幻场景中自然牙齿的视频进行的实验表明,所提出的网络在多任务中获得了接近实时处理的最新结果。我们在https://doi.org/10.34808/1jby-ay90 上发布了video -lab,这是第一个具有多任务标签的牙科视频数据集,以促进相关视频处理应用的进一步研究。

Related Work

UberNet [9] and cross-stitch networks [16] are encoder-focused architectures that propagate task outputs across scales in the encoder.

Multi-modal distillation in PAD-Net [27] and PAP-Net [29] are decoder-focused networks that fuse outputs of task heads to make the final dense predictions but only at a single scale.

MTI-Net [24], which is most similar to our architecture, extends the decoder fusion by propagating task-specific features bottom-up across multiple scales through the encoder.

Instead of propagating the task features in scale-specific distillation modules across scales to the encoder, our network simultaneously propagates task outputs to the encoder and to the task heads in the decoder. Furthermore, the networks make dense task prediction in static images while we extend our network to videos.

Contribution

i) a novel application of a microcamera in computer-aided dental intervention for continuous tooth macro-visualization during drilling (居然是硬件创新)(悻悻离去)

(ii)    a new, asymmetrically annotated dataset of natural teeth in phantom scenes with pairs of frames of compromised and good quality using a beam splitter,

(iii)  a novel deep network for video processing that propagates task outputs to encoder and decoder across multiple scales to model task interactions, and (iv) demonstration that an instantiated model e˙ectively addresses multi-task video enhancement in our application by matching and surpassing state-of-the-art re-sults of single task networks in near real-time.

Method

通过不同任务间的交互来增强视频的处理效果

视频增强任务是相互关联的。比如:
--对齐视频帧(aligning video frames)有助于去模糊(deblurring)。
--去噪(denoising)和去模糊可以揭示有助于运动估计(motion estimation)的图像特征。
这种相互依赖性可以通过设计一个多任务模型来充分利用。

MOST-Net 是一种多输出、多尺度、多任务的网络架构。它的目标是通过编码器和解码器之间的多尺度特性建模任务间的交互。网络的输出包括多个任务(用 T 表示),这些任务在不同尺度(用 s 表示)上都有输出。例如

传播方式:

  • 尺度内传播:任务的输出会在当前尺度内传播。
  • 跨尺度传播:任务输出会从较低的尺度上采样(upsample),然后传播到较高尺度的解码器层和任务分支中。

约束条件:

ui denotes some operator, for instance, the upsampling operator for seg-mentation or the scaling operator for homography estimation.

Problem Statement

模型需要同时解决视频恢复、牙齿分割和运动估计任务,并在一个退化图像生成模型的假设下进行学习和优化。

T = 3 and O1: video restoration, O2:segmentation , O3: homography esti-mation. 

video stream generates observations , where t is the time index and P > 0 is a scalar value referring to the number of past frames.

The problem is to 1. estimate a clean frame, 2. a binary teeth segmentation mask and 3. approximate the inter-frame motion by a homography matrix, denoted by the triplet (三个任务的联合输出在尺度 s=1上表示为一个三元组↓)



Let x correspond to pixel location. Given per-pixel blur kernels kx,t of size K, the degraded image(为了模拟输入视频的退化过程(如模糊和噪声)) at s = 1 is generated as:

We assume multiple independently moving objects present in the considered scenes, while our task is to estimate only the motion related to the object of interest (i.e. teeth), which is present in the region indicated by non-zero values of mask M:

∀t ∀x 是指所有t和x

Training***

在多任务和多尺度的深度学习模型中定义损失函数和优化目标

数据集

Loss Function

 

需要对 N(样本数)、T(任务数)和 S(尺度数)进行总共 N * T * S 次求和操作。

损失函数类型

模型通过最小化总损失函数来学习参数 Θ,以便同时优化所有任务和所有尺度下的输出预测。优化过程需要考虑不同任务之间的相互关系和尺度之间的协同作用(多任务多尺度学习的核心思想)。

感觉这个multi task learning这块还是有点没搞清楚,我再看看别的论文

Structure

MOST-Net enables refinement of lower scale segmentations by upsampling and inputting them at the task-specific branches of higher scales.

Encoders

MOST-Net extracts features from two input frames Bt−1 and Bt independently at three scales.也就是说,模型同时在多个尺度上处理输入数据。

U-shaped Downsampling : features are extracted via 3 × 3 convolutions with strides of 1, 2, 2 for s = 1, 2, 3 followed by ReLU activations and 5 residual blocks [4] at each scale. The residual connections are augmented with an additional branch of convolutions in the Fast Fourier domain.
output channel dimension :2^(s+4)

At each scale, featuresandare concatenated and a channel attention mechanism follows [30] to fuse them into

MOST-Net uses homography outputs from lower scales to warp encoder features from the previous time step as

Decoders

encoder featuresare passed onto the expanding blocks scale-wisely via the skipping connections.

At the lower scale (s = 3),are directly passed on a stack of two residual blocks with 128 output channels. transposed convolutions with strides of 2 are used twice to recover the resolution scale.

At higher scales (s < 3), featuresare first concatenated with the upsampled decoder features and convolved by 3X 3 kernels to halve the number of channels.(为啥要减半?)Subsequently, they are propagated onto two residual blocks with 64 and 32 output channels each. The residual block outputs constitute scale-specific shared backbones. Lightweight task-specific branches follow to estimate the dense outputs. Specifically, one 3×3 convolution estimates  and two 3 × 3 convolutions, separated by ReLU, yieldat each scale

At each scale, homography estimation modules estimate 4 offsets(偏移量), related 1-1 to homographies via the Direct Linear Transformation (DLT) as in [5,12].  The motion gated attention modules multiply featureswith segmentationsto filter out context irrelevant to the motion of the teeth.The channel dimensionality is then halved by a 3 × 3 convolution while a second one extracts features from the restored output. The concatenation of the two streams forms features 

Homography Estimation Module: At each scale,  and are employed to predict the offsets with shallow downstream networks. Predicted offsets at lower scales are transformed back to homographies and cascaded(串联) bottom-up [12] to refine the higher scale ones.

Similarly to [5], we use blocks of 3 × 3 convolutions coupled with ReLU, batch normalization and max-pooling to reduce the spatial size of the features. Before the regression layer, a 0.2 dropout is applied.or s = 1, the convolution output channels are 64, 128, 256, 256 and 256. For s=2,3 the network depth is cropped from the second and third layers onwards respectively.

Task-Specific Branches

这段是自己根据gpt加的,以前没弄过多任务学习,方便理解*

Each task (colorization, motion estimation, segmentation) is handled by separate branches of the network. These branches can be seen in the image as the paths where F1,F2,F3 (the features at different scales) are passed through different processing stages (e.g., motion gated attention, channel attention, homography estimation) to produce task-specific outputs, such as the colorized frame Rt, mask Mt, and flow Ht​.

The network is optimized for multiple tasks by using shared features across different task-specific branches, while each branch focuses on a particular task's output (colorization, segmentation, motion estimation).The losses corresponding to each task are computed separately and combined in the final objective function, which allows the model to simultaneously learn multiple tasks while sharing common feature representations.

Experiment

Dataset

Vident-lab: a dataset for multi-task video processing of phantom dental scenes - Open Research Data - Bridge of Knowledge

  • Frame-to-Frame (F2F) Training:

    • The model is trained using static video fragments recorded with a camera (C1). The goal is to apply a trained image denoiser to clean noisy frames, obtain denoised frames and and their noise maps
  • Denoising Process:

    • The noisy frames are first denoised using the trained model. Then, these denoised frames are temporally interpolated (using 17 frames) to generate a blurry effect. The temporal interpolation helps in simulating realistic motion blur.
  • Adding Noise:

    • After the blur effect, noise maps are added to the blurry frames(The denoised frames are tem-porally interpolated [19] 8 times and averaged over a temporal window of 17 frames to synthesize real-istic blur) to form the input video frames (B). The noise maps represent the original noise that would have been present in the actual noisy frames.
  • Colorization: registration of frames between two di˙erent modalities C1 and C2

    • To generate output video frames (R), frames from camera C1 are colorized using a process where frames from a second camera (C2) are mapped to create the ground truth frames.
    • Specifically, the frames from C1 are colorized based on data from C2 to form the colorized video frames. This helps in overcoming the difficulty of aligning the frames between the two cameras and creating accurate pixel-to-pixel correspondences.
  • Color Mapping Network:

    • A color mapping (CM) network is learned to predict parameters that map 3D functions from the dental scene colors of camera C2 to the camera C1. This network helps achieve precise color mapping and ensures accurate spatial correspondence between frames B and R.

Segmentation masks and homographies 单应性

HRNet48 [22] pretrained on ImageNet, is fine-tuned on our annotations to automatically segment the teeth in the remaining frames in all three sets. We compute optical flows between consecutive clean frames with RAFT [23]. Motion fields are cropped with teeth masks Mt to discard other moving objects, such as the dental bur or the suction tube, as we are interested in stabilizing the videos with respect to the teeth. Subsequently, a partial aÿne homography H is fitted by RANSAC to the segmented motion field.

Setup

We train, validate, and test all methods on our dataset (Tab. 1). In all MOST-Net training runs, we set λ1, λ2, λ3 to 2 × 10−4, 5 × 10−5 and 1 for balancing tasks in Eq. 4.

augmented by horizontal and vertical flips with 0.5 probability, random channel perturbations, and color jittering, after [31]. 

batch size 16 , Adam , Learning rate 1e − 4, decayed to 1e − 6 with cosine annealing

PyTorch 1.10 (FP32). The inference speed is reported in frames-per-second (FPS) on GPU NVidia RTX 5000.

Results

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2280257.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Linux应用编程(五)USB应用开发-libusb库

一、基础知识 1. USB接口是什么&#xff1f; USB接口&#xff08;Universal Serial Bus&#xff09;是一种通用串行总线&#xff0c;广泛使用的接口标准&#xff0c;主要用于连接计算机与外围设备&#xff08;如键盘、鼠标、打印机、存储设备等&#xff09;之间的数据传输和电…

学习golang语言时遇到的难点语法

作者是java选手&#xff0c;实习需要转go&#xff0c;记录学习go中遇到的一些与java不同的语法。 defer defer特性 1. 关键字 defer 用于注册延迟调用。 2. 这些调用直到 return 前才被执。因此&#xff0c;可以用来做资源清理。 3. 多个defer语句&#xff0c;按先进…

cocosCreator动态调整pageView下面的标记indicator

pageView是我们在开发过程中经常使用到的一个组件&#xff0c;但是之前很少去动态修改过该属性的indicator,一般都是使用的默认的。今天产品要求实现一个动态效果&#xff0c;就是当页面左滑或者右滑时&#xff0c;下面的标记也会有一个左右滑动的效果(不知道怎么描述合适&…

C语言进阶习题【1】指针和数组(4)——指针笔试题4

笔试题7&#xff1a;下面代码输出是是什么&#xff1f; #include <stdio.h> int main() {char *a[] {"work","at","alibaba"};char**pa a;pa;printf("%s\n", *pa);return 0; }分析 代码结果 笔试题8&#xff1a;下面代码输…

服务化架构 IM 系统之应用 MQ

在微服务化系统中&#xff0c;存在三个最核心的组件&#xff0c;分别是 RPC、注册中心和MQ。 在前面的两篇文章&#xff08;见《服务化架构 IM 系统之应用 RPC》和《服务化架构 IM 系统之应用注册中心》&#xff09;中&#xff0c;我们站在应用的视角分析了普适性的 RPC 和 注…

【Rabbitmq】Rabbitmq高级特性-发送者可靠性

Rabbitmq发送者可靠性 发送者重连发送者确认1.开启确认机制2.ReturnCallback3.ConfirmCallback MQ的可靠性数据持久化交换机持久化队列持久化消息持久化 Lazy Queue 总结其他文章 Rabbitmq提供了两种发送来保证发送者的可靠性&#xff0c;第一种叫发送者重连&#xff0c;第二种…

【技术总结类】2024,一场关于海量数据治理以及合理建模的系列写作

目录 1.今年的创作路线 2.先说第一条线 2.1.由日志引出的海量文本数据存储和分析问题 2.2.监控以及监控的可视化 2.3.数据量级再往上走牵扯出了大数据 2.4.由大数据牵扯出的JAVA线程高级内容 3.第二条线&#xff0c;也是2025要继续的主线 1.今年的创作路线 今年的写作内…

【深度学习项目】语义分割-DeepLab网络(DeepLabV3介绍、基于Pytorch实现DeepLabV3网络)

文章目录 介绍深度学习语义分割的关键特点主要架构和技术数据集和评价指标总结 DeepLabDeepLab 的核心技术DeepLab 的发展历史DeepLab V3网络结构获取多尺度信息架构Cascade ModelASPP ModelMulti-GridPytorch官方实现的DeepLab V3该项目主要是来自pytorch官方torchvision模块中…

Python Pyside6 加Sqlite3 写一个 通用 进销存 系统 初型

图: 说明: 进销存管理系统说明文档 功能模块 1. 首页 显示关键业务数据商品总数供应商总数本月采购金额本月销售金额显示预警信息库存不足预警待付款采购单待收款销售单2. 商品管理 商品信息维护商品编码(唯一标识)商品名称规格型号单位分类进货价销售价库存数量预警…

数字电子技术基础(十五)——MOS管的简单介绍

目录 1 MOS的简单介绍 1.1 MOS简介 1.2 MOS管的基本结构 1.3 MOS管工作时的三个区域 1.4 MOSEF的结构的工作原理 1 MOS的简单介绍 1.1 MOS简介 绝缘栅型场效应管&#xff0c;简称MOS管&#xff0c;全称为金属-氧化物-半导体场效应晶体管&#xff08;Metal-Oxide-Semic…

【BUUCTF】BUU XSS COURSE 11

进入题目页面如下&#xff0c;有吐槽和登录两个可注入点 根据题目可知是一道XSS 登陆界面没有注册&#xff0c;尝试简单的SQL注入也不行 回到吐槽界面&#xff0c;输入简单的xss代码 <script>alert(1)</script> 访问网址&#xff0c;发现回显不出来&#xff0c;猜…

Codeforces Round 903 (Div. 3) E. Block Sequence

题解&#xff1a; 想到从后向前DP f[i] 表示从 i ~ n 转化为“美观”所需要的最少的步骤 第一种转移方式&#xff1a;直接删除掉第i个元素&#xff0c;那么就是上一步 f[i 1] 加上 1;第二种转移方式&#xff1a;从第 i a[i] 1 个元素直接转移&#xff0c;不需要增加步数&a…

分布式系统通信解决方案:Netty 与 Protobuf 高效应用

分布式系统通信解决方案&#xff1a;Netty 与 Protobuf 高效应用 一、引言 在现代网络编程中&#xff0c;数据的编解码是系统设计的一个核心问题&#xff0c;特别是在高并发和低延迟的应用场景中&#xff0c;如何高效地序列化和传输数据对于系统的性能至关重要。随着分布式系…

【C++】模板(进阶)

本篇我们来介绍更多关于C模板的知识。模板初阶移步至&#xff1a;【C】模板&#xff08;初阶&#xff09; 1.非类型模板参数 1.1 非类型模板参数介绍 模板参数可以是类型形参&#xff0c;也可以是非类型形参。类型形参就是我们目前接触到的一些模板参数。 //类型模板参数 …

2025年入职/转行网络安全,该如何规划?网络安全职业规划

网络安全是一个日益增长的行业&#xff0c;对于打算进入或转行进入该领域的人来说&#xff0c;制定一个清晰且系统的职业规划非常重要。2025年&#xff0c;网络安全领域将继续发展并面临新的挑战&#xff0c;包括不断变化的技术、法规要求以及日益复杂的威胁环境。以下是一个关…

Golang Gin系列-4:Gin Framework入门教程

在本章中&#xff0c;我们将深入研究Gin&#xff0c;一个强大的Go语言web框架。我们将揭示制作一个简单的Gin应用程序的过程&#xff0c;揭示处理路由和请求的复杂性。此外&#xff0c;我们将探索基本中间件的实现&#xff0c;揭示精确定义路由和路由参数的技术。此外&#xff…

K8S-Pod的环境变量,重启策略,数据持久化,资源限制

1. Pod容器的三种重启策略 注意&#xff1a;k8s所谓的重启容器指的是重新创建容器 cat 07-restartPolicy.yaml apiVersion: v1 kind: Pod metadata:name: nginx-web-imagepullpolicy-always spec:nodeName: k8s233.oldboyedu.com## 当容器异常退出时&#xff0c;始终重启容器r…

常见Arthas命令与实践

Arthas 官网&#xff1a;https://arthas.aliyun.com/doc/&#xff0c;官方文档对 Arthas 的每个命令都做出了介绍和解释&#xff0c;并且还有在线教程&#xff0c;方便学习和熟悉命令。 Arthas Idea 的 IDEA 插件。 这是一款能快速生成 Arthas命令的插件&#xff0c;可快速生成…

Django学习笔记(安装和环境配置)-01

Django学习笔记(安装和环境配置)-01 一、创建python环境 1、可以通过安装Anaconda来创建一个python环境 # 创建一个虚拟python环境 conda create -n django python3.8 # 切换激活到创建的环境中 activate django2、安装django # 进入虚拟环境中安装django框架 pip install …

C# 委托和事件思维导图

思维导图 下载链接腾讯云盘 https://share.weiyun.com/fxBH9ESl