精读 Generating Mammography Reports from Multi-view Mammograms with BERT

news2025/2/22 5:48:16

精读(非常推荐) Generating Mammography Reports from Multi-view Mammograms with BERT(上)

这里的作者有个叫 Ilya 的吓坏我了

1. Abstract

Writing mammography reports can be errorprone and time-consuming for radiologists. In this paper we propose a method to generate mammography reports given four images, corresponding to the four views used in screening mammography. To the best of our knowledge our work represents the first attempt to generate the mammography report using deep-learning. We propose an encoder-decoder model that includes an EfficientNet-based encoder and a Transformerbased decoder. We demonstrate that the Transformer-based attention mechanism can combine visual and semantic information to localize salient regions on the input mammograms and generate a visually interpretable report. The conducted experiments, including an evaluation by a certified radiologist, show the effectiveness of the proposed method. Our code is available at
代码: https://github.com/sberbank-ai-lab/mammo2text.

在这里插入图片描述

2. Introduction

Breast cancer represents a global healthcare problem (Glo, 2016). Increasing numbers of new cases and deaths are observed in both developed and less developed countries, only partially attributable to the increasing population age. Serial screening with mammography is the most effective method to detect early stage disease and decrease mortality. The goal of screening is to detect breast cancers when still curable to decrease breast cancer-specific mortality (Duffy et al., 2020).
初衷是在可治愈的前提下,减少死亡率

The European Society of Breast Imaging (EUSOBI) together with 30 national breast radiology bodies recommend that only qualified radiologists should be involved in screening programs. (Sardanelli et al., 2017).As the amount of organized breast screening programs grows across the world, the burden on radiologists increases with it. In National screening programs such as in Holland or Sweden, radiologists may need to read 100 radiology images per hour (Abbey et al., 2020). With a growing number of screening programs , we need more trained radiologists and new technologies that can make their workflow more effective. Since one of the most time consuming procedures in radiology is writing medical-imaging reports, we explore the potential for deep-learning to automatically generate diagnostic reports of screening mammograms.
提出由于工作负担导致,智能生成报告的背景
The rapid evolution of deep learning and artificial intelligence technologies enables them to be used as a strong tool for providing clinical decision-making support to the medical community. While many problems in the area of medical imaging and text analysis have been addressed effectively, there is no known approach to generating clinical reports for mammography studies. There are various reasons for this, such as the requirements regarding the accuracy, completeness and diagnostic relevance of the clinical information contained in the report. In this article, we present a framework (Figure1) that takes mammograms as an input, automatically generates mammography reports, and visualizes the attention of the model to provide the interpretability of the process.
We use an encoder-decoder architecture, where the encoder extracts visual features and the decoder generates reports. We adopt a convolutional neural network, specifically EfficientNet (M Tan, 2019), to extract visual features of the four images, corresponding to the four views used in screening mammography.

引文:
Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
EfficientNet (M Tan, 2019) 实际上是一种构建视觉模型网络的范式,😐 为什么要使用这样的视觉模型?如何更好的构建起来一个更好的混合视觉模型,如何组合参数,这里之所以使用这个是不是因为,医学图像并 不同于 自然图像,尤其是钼靶图像这样的有精确化的钙化点,会不会是就是它 重新思考 构建视觉模型结构的原因。

😐 这里实际上,作者解释了对于乳腺钼靶这样的高分辨率图像,使用 EfficientNet B0 可以效率更高!
We use a deep multi-view (N Wu, 2019) CNN based on EfficientNet B0 (M Tan, 2019). We chose EfficientNet B0 because it is relatively lightweight and fits in GPU memory when using high resolution images. We have one EfficientNet instance for all views (R-CC, L-CC, R-MLO, L-MLO), i.e. model weights are shared. The first convolutional layer is replaced to accept a one-channel image. The last fully-connected layer of EfficientNet is discarded. Outputs from all four views are averaged by channels and one fully connected layer is added.


For language modeling, we utilize BERT (Devlin et al., 2018), inserting an additional attention sub-layer to perform multi-head attention over the regional feature embeddings produced by the encoder.

We modify the Transformerbased attention mechanism (Vaswani et al., 2017) such that it attends to the visual information on four mammography views and previously generated words. We use the attention scores to build visually interpretable image-text attention mappings.

In addition to that, we conduct a series of indepth quantitative and qualitative experiments with the help of an experienced radiologist to demonstrate the clinical validity of our approach. We compare the predictions of our models with the ground truth to understand where the models make mistakes and demonstrate that our best model successfully describes different parts of the breast, and detects pathological regions and abnormalities. We evaluate the image-text attention mappings to demonstrate the interpretability of our model. As far as we are aware, our work represents the first attempt to generate the mammography report using deep-learning.

重点看看那个attention map,和视觉模型的比例,从论文上看效果非常好

To summarize, we make the following contributions in this paper:
  1. We propose a novel framework for mammography report generation using EfficientNet in the
    encoder and BERT in the decoder.
  2. We demonstrate that the Transformer-based attention mechanism can combine visual and textual
    information to localize salient regions on the input mammograms and generate a visually interpretable
    report.
  3. We conduct doctor evaluation and extensive experiments with automatic metrics to show the effectiveness of the proposed framework.
  4. We conduct a qualitative analysis including interpretation of image-text attention mappings to demonstrate how the model is able to generate mammography reports in a meaningful way.

3. Related work

The task of image captioning is creating a model that given a previously unseen query image generates a caption that is both grammatically and semantically correct. The main approaches to image captioning are retrieval-based, template-based and novel caption generation.

方法汇总
  1. Retrieval-based, 检索式(Retrieval-based): 这种方法通过在一个预先定义的数据库中搜索最匹配当前图像的描述来工作。数据库中的描述是由人类创建的,针对不同的图像。当给定一个新图像时,系统会尝试找到与之最相似的图像(或图像集),然后将找到的图像的描述作为新图像的描述。这种方法的优点是生成的描述文本质量较高,因为所有的描述都是人类编写的。但是,它的缺点是难以扩展到新的、未见过的图像,而且在数据库中找到精确匹配的图像可能很困难。
  2. Template-based 模板式(Template-based): 这种方法使用预定义的模板来生成描述,模板中包含可变的插槽,这些插槽可以根据图像的内容动态填充。例如,模板可以是“这是一张关于[对象]的照片,在[场景]中”,其中“[对象]”和“[场景]”会根据图像识别的结果填充。模板方法的优点是易于实现和理解,而且生成的文本通常语法正确。然而,它的缺点是生成的描述可能缺乏多样性和创造性,因为所有的描述都是基于固定模板生成的。
  3. Novel caption generation. 新颖描述生成(Novel Caption Generation): 这种方法使用深度学习模型,如卷积神经网络(CNN)和循环神经网络(RNN)或Transformer模型,直接从图像中生成新颖的描述。这种方法不依赖于预定义的模板或数据库,而是通过学习大量图像和其对应描述的数据集,使模型能够学会如何根据图像的内容生成描述。新颖描述生成方法的优点是能够创造出多样化且丰富的描述,而且可以应用于未见过的图像。然而,这种方法的挑战在于需要大量的标注数据来训练模型,且模型的训练计算成本较高。

In retrieval-based methods (Hodosh et al., 2013), (Ordonez et al., 2011) candidate captions for query images are selected from a pool of existing captions based on some measure of similarity. The downside of this approach is the inability to generate novel image-specific captions.

In template-based methods (Farhadi et al., 2010), (Kulkarni et al., 2013), (Li et al., 2011) image captions are generated by filling the blanks in fixed templates. These methods can generate grammatically and semantically correct novel captions not present in the training set but cannot generate variable-length captions.

Novel caption generation methods (Xu et al., 2015), (Yao et al., 2017), (You et al., 2016) use a representation of the query image as an input for a language model responsible for generating the captions. This approach follows the encoder-decoder architecture first applied to machine translation tasks (Cho et al., 2014).

To generate an image caption, a representation of the image must first be constructed either via generating handcrafted features or extracting such features automatically, for example using deep neural networks. Examples of hand-crafted features are local binary patterns (Ojala et al., 2002), scaleinvariant keypoints (Lowe, 2004), or histograms of oriented gradients (Dalal and Triggs, 2005). Automatic feature extraction from images is commonly used by applying convolutional neural networks (CNN) (LeCun et al., 1998) to the query image. These features may be further enhanced, for example by using a spatial Transformer (Pedersoli et al.,2017).

A sub-field of image captioning is diagnostic captioning (DC). Diagnostic captioning is automatic generation of diagnostic text based on a set of medical images of a patient. DC systems can increase the speed of producing a report for experienced physicians and decrease the number of diagnostic errors for inexperienced doctors (for a recent survey on DC methods see (Pavlopoulos et al., 2021)). The majority of the work in DC is done using encoder-decoder architecture. In addition to evaluation of grammatical and semantical correctness of captions, which is commonly assessed by calculating lexical overlap between generated captions and ground truth (Pavlopoulos et al., 2019), DC quality can be assessed by clinical correctness by conducting clinical experiments with physicians evaluating the generated reports (Zhang et al., 2019), (Liu et al., 2019).

Language models commonly used in DC usually apply recurrent neural networks (RNN) such as LSTM (Hochreiter and Schmidhuber, 1997), see (Vinyals et al., 2015) (Xu et al., 2015), with works using Transformer-based models beginning to appear (Chen et al., 2020) . A common approach in DC is the use of ’visual attention’ that allows the decoder to focus on particular areas of input images when generating the captions (Jing et al., 2017), (Yuan et al., 2019). Such mechanisms also can be used to highlight the regions of interest on the input images adding to the interpretability of the models (Zhang et al., 2017).

(Chen et al., 2020)
Zhihong Chen

(Jing et al., 2017)
Baoyu Jing

(Yuan et al., 2019)
Jianbo Yuan,

We split the dataset into the training, validation and test subsets in the proportion of 91%, 4% and 5% respectively (having 22463, 934 and 1229 cases in each subset). The splits are the same for encoder
这里可能一个case对应多个标签,所以,这里的labels并不是总数的和。
在这里插入图片描述
这里数据集划分的很仔细,值得学习,但是具体如何使用,它应该是使用了一种方法,尽量使得种类平衡。

We use a deep multi-view (N Wu, 2019) CNN based on EfficientNet B0 (M Tan, 2019). We chose EfficientNet B0 because it is relatively lightweight and fits in GPU memory when using high resolution images. We have one EfficientNet instance for all views (R-CC, L-CC, R-MLO, L-MLO), i.e. model weights are shared . The first convolutional layer is replaced to accept a one-channel image. The last fully-connected layer of EfficientNet is discarded. Outputs from all four views are averaged by channels and one fully connected layer is added.

模型的设计和我预想大体的一致,但是也有很多不同,提供了一个很好的思路。
  1. 第一层卷积核并没有使用超大卷积核,而是正常的卷积核,同时,输入通道为1,而不是3
  2. 所有的图片经过同一个视觉编码器进行学习,所谓的分享权重
  3. 所有的output通过平均相加,最终得到输出特征
  4. 去掉了最后一层的全连接层

The encoder is pretrained to predict multilabel targets important for diagnosis in mammography screening, shown in Table 1. The binary targets were extracted with regular expressions from text descriptions of the studies. Targets № 0-4 are typical pathological changes in breasts tissues. During training, the images are cropped and resized to 1350x900 px.

这里的数据集处理的比我好,同时视觉模型的大概肯定也比我自己设计的好,但是预训练的过程现在可以使用更多方法。因为现在有了CLIP,GLoRIA这样的模型进行预训练模型结构。

跳过模型的结构,先看模型的测试部分

我觉得这里展示了一个很好的模型测试范式,这里的random很好的说明了模型的效果例子,同时不同于之前看到的BLEU, METEOR, ROUGE-L 这里还告诉了我们可以使用CIDEr模型。
在这里插入图片描述

评估指标:
  1. BLEU (Bilingual Evaluation Understudy):这是机器翻译质量评估中使用最广泛的指标之一。它通过计算机器生成的翻译和一组人工翻译之间n-grams的重叠来评估翻译的质量。BLEU分数越高,意味着生成的翻译和参考翻译之间的重叠越多,通常认为翻译质量越好。

  2. METEOR (Metric for Evaluation of Translation with Explicit Ordering):它是对BLEU的改进,不仅考虑了单词的精确匹配,还考虑了词形、同义词和词序的匹配。METEOR也会对匹配的单词进行加权,给予不同类型的匹配(如词干匹配或同义词匹配)不同的重要性。

  3. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence):ROUGE主要用于评估自动文本摘要或机器翻译的质量。ROUGE-L的“L”代表最长公共子序列(LCS),它考虑了候选文本和参考文本之间的最长公共子序列。这个度量考虑了候选文本和参考文本中的词序,使用最长公共子序列来评估它们之间的相似度。

  4. CIDEr (Consensus-based Image Description Evaluation):专为评估图像描述任务设计的度量,通过计算候选描述和参考描述集中n-grams的相似性来衡量描述的质量。CIDEr特别强调词汇的独特性,通过TF-IDF统计来增加稀有词汇的权重,以鼓励生成的描述能够反映出图片的特定和独特内容。

在这里插入图片描述

这里更甚使用了具体到病情的评估指标,这种好的思路真是太好了,太值得学习了

在这里插入图片描述

这样的展示图片真的太完美了,太值得学习了,俄罗斯的人工智能搞的是真好

模型的结构在(下)解析,这里跳到结论部分

In this paper we present a first-of-its-kind framework for generating mammography reports given four mammography views using deep-learning. Our model utilizes pretrained models including EfficientNet for visual extraction and BERT for report generation. We demostrate that the Transformerbased attention mechanism that simultaneously attends to four mammography views and text from the report significantly improves the performance. Our method provides a novel perspective for breast screening: generating mammography reports and providing image-text attention mappings, which makes the automatic breast screening process semantically and visually interpretable. The validity of our approach is confirmed by the corresponding doctor evaluation. In the conducted qualitative analysis we demonstrate that our best model successfully detects pathological regions, and describes abnormalities and parts of the breast.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1560028.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

使用STM32 MCU模拟实现PPS+TOD授时信号

简介 PPSTOD是授时信号的一种,用来传递准确的时间信息。 PPS,Pulse Per Second,是每秒一次的脉冲信号,其上升沿表示整秒的时刻。TOD,Time of Day,是时间信息。是跟随在每个PPS信号后的由串口发出的一句报…

学浪如何录屏学浪解除录屏限制

买过学浪课程的都知道,学浪PC客户端会限制你录屏,可是我们在学习的过程中需要对某个画面进行截图保存,于是为了解决这个问题,我开发了小浪助手.exe,目的就是为了买学浪课程的人可以随时随地的解除录屏限制 工具我已经…

Linux是怎么发送一个网络包的?

目录 摘要 1 从 send 开始 2 传输层 3 网络层 4 网络接口层 4.1 邻居子系统 4.2 网络设备子系统 4.3 软中断发送剩余的 skb 4.4 硬中断又触发软中断 总结 摘要 一个网络包的发送,始于应用层,经层层协议栈的封装,终于网卡。今天来循…

Java_21 完成一半题目

完成一半题目 有 N 位扣友参加了微软与力扣举办了「以扣会友」线下活动。主办方提供了 2*N 道题目,整型数组 questions 中每个数字对应了每道题目所涉及的知识点类型。 若每位扣友选择不同的一题,请返回被选的 N 道题目至少包含多少种知识点类型。 示例…

Acrobat Pro DC 2023 for Mac PDF编辑管理软件

Acrobat Pro DC 2023 for Mac是一款功能强大的PDF编辑和管理软件,旨在帮助用户轻松处理PDF文件。它提供了丰富的工具和功能,使用户可以创建、编辑、转换和注释PDF文件,以及填写和签署PDF表单。 软件下载:Acrobat Pro DC 2023 for …

机器学习全攻略:概念、流程、分类与行业应用案例集锦

目录 1.引言 2.从零开始认识机器学习:基本概念与重要术语 3.五步走:掌握机器学习项目执行的完整流程 3.1.问题定义与数据收集 3.2.数据预处理与特征工程 3.3.模型选择与训练 3.4.模型评估与优化 3.5.模型部署与监控 4.深入了解各类机器学习方法…

Monkey 和 TextMonkey ---- 论文阅读

文章目录 Monkey贡献方法增强输入分辨率多级描述生成多任务训练 实验局限结论 TextMonkey贡献方法移位窗口注意(Shifted Window Attention)图像重采样器(Image Resampler)Token Resampler位置相关任务(Position-Relate…

云计算探索-如何在服务器上配置RAID(附模拟器)

一,引言 RAID(Redundant Array of Independent Disks)是一种将多个物理硬盘组合成一个逻辑单元的技术,旨在提升数据存取速度、增大存储容量以及提高数据可靠性。在服务器环境中配置RAID尤其重要,它不仅能够应对高并发访…

实景三维技术:开启自然资源管理的新篇章

随着科技的不断进步,实景三维技术已经在多个领域得到了广泛的应用。而在自然资源管理领域,实景三维技术更是发挥着越来越重要的作用。本文将介绍实景三维在自然资源管理领域的应用,探讨其带来的优势和变革。一、什么是实景三维技术&#xff1…

MHA高可用-解决MySQL主从复制的单点问题

目录 一、MHA的介绍 1.什么是 MHA 2.MHA 的组成 2.1 MHA Node(数据节点) 2.2 MHA Manager(管理节点) 3.MHA 的特点 4. MHA工作原理总结如下: 二、搭建 MySQL MHA 实验环境 …

文献阅读:使用 CellChat 推理和分析细胞-细胞通信

文献介绍 「文献题目」 Inference and analysis of cell-cell communication using CellChat 「研究团队」 聂青(加利福尼亚大学欧文分校) 「发表时间」 2021-02-17 「发表期刊」 Nature Communications 「影响因子」 16.6 「DOI」 10.1038/s41467-0…

DevSecOps安全工具链介绍

目录 一、概述 二、安全工具链在平台中的定位 2.1 概述 2.2 分层定位 2.2.1 不同阶段的安全工具 2.2.2 安全工具金字塔 2.3 安全流水线集成概览 2.3.1 概述 2.3.2 标准流水线集成安全工具链概览图 三、安全工具链分类 3.1 概述 3.2 威胁建模类 3.2.1 威胁建模的概念…

47 vue 常见的几种模型视图不同步的问题

前言 这里主要是来看一下 关于 vue 中的一些场景下面 可能会出现 模型和视图 不同步更新的情况 然后 这种情况主要是 vue 中的对象 属性没有响应式的 setter, getter 然后 我们这里就来看一下 大多数的情况下的一个场景, 和一些处理方式 当然 处理方式主要是基于 Vue.set, …

53 v-bind 和 v-model 的实现和区别

前言 这个主要的来源是 偶尔的情况下 出现的问题 就比如是 el-select 中选择组件之后, 视图不回显, 然后 model 不更新等等 这个 其实就是 vue 中 视图 -> 模型 的数据同步, 我们通常意义上的处理一般是通过 模型 -> 数据 的数据同步, 比如 我们代码里面更新了 model.…

正多边形拓扑与泛函

(原创:Daode3056) 也许,关于“拓扑”,“泛函”几本书上的内容与实例都是大同小异,总是那么点内容,数学要开拓一些新领域与新内容才能满足不断发展的社会与工业各种需要。本文就以人工智能生成对…

鸿蒙OS开发实例:【ArkTS 实现MQTT协议】

介绍 MQTT是物联网中的一种协议,在HarmonyOS API9平台,解决方案以C库移植为实现方案。 遥遥领先的平台,使用MQTT怎能不遥遥领先呢! 新年快乐,本篇将带领你手把手实现HarmonyOS ArkTS语言的MQTT协议。 准备 阅读…

阿里云通用算力型u1云服务器配置性能评测及价格参考

阿里云服务器u1是通用算力型云服务器,CPU采用2.5 GHz主频的Intel(R) Xeon(R) Platinum处理器,ECS通用算力型u1云服务器不适用于游戏和高频交易等需要极致性能的应用场景及对业务性能一致性有强诉求的应用场景(比如业务HA场景主备机需要性能一致)&#xf…

如何使用极狐GitLab 自定义 Pages 根域名

本文作者:徐晓伟 GitLab 是一个全球知名的一体化 DevOps 平台,很多人都通过私有化部署 GitLab 来进行源代码托管。极狐GitLab 是 GitLab 在中国的发行版,专门为中国程序员服务。可以一键式部署极狐GitLab。 本文主要讲述了极狐GitLab Pages …

JS实现正则匹配文本中的URL地址

如何利用JS正则表达式,提取文本中的URL地址呢? 目录 一、程序代码 二、运行结果 一、程序代码 function extractUrls(text) {var urlRegex /(https?:\/\/[^\s])/g;return text.match(urlRegex); }var text "3.02 复制打开抖音,看…

YOLOV8逐步分解(3)_trainer训练之模型加载

yolov8逐步分解(1)--默认参数&超参配置文件加载 yolov8逐步分解(2)_DetectionTrainer类初始化过程 接上2篇文章,继续讲解yolov8训练过程中的模型加载过程。 使用默认参数完成训练器trainer的初始化后,执行训练函数train()开始YOLOV8的训练。 1. t…