Re-identification from histopathology images





大多数DL方法的共同点是需要大量数据进行训练。这促使了大规模组织病理学数据集的发布以加速DL研究,例如CAMELYON数据集(Litjens等,2018)、癌症基因组图谱(The Cancer Genome Atlas, TCGA)的乳腺浸润性癌数据集(Lingle等,2016)以及肿瘤增殖评估挑战2016数据集(Veta等,2019)。

在临床实践中,WSI通常伴随有患者的私人信息,例如姓名、年龄、性别等。这些信息被归类为受保护的健康信息(PHI),在大多数国家受到政府法规的保护,例如美国的《健康保险携便与责任法案》(HIPAA,Centers for Disease Control and Prevention, 2023)或欧洲的《通用数据保护条例》(GDPR,European Union, 2023)。因此,数据匿名化是发布医学数据过程中的关键步骤(Willemink等,2020;Moore等,2015;Bisson等,2023)。在WSI的背景下,这意味着需要移除所有可能存在PHI的地方,例如文件名或任何切片标签。此外,还需要仔细检查切片文件的元数据(Clark等,2013)。






In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathologyimages, for example, for revealing the subtypes of tumors or the primary origin of metastases. These modelsrequire large datasets for training, which must be anonymized to prevent possible patient identity leaks.This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in largehistopathology datasets with substantial accuracy. In addition, we compared a comprehensive set of state-ofthe-art whole slide image classifiers and feature extractors for the given task. We evaluated our algorithms ontwo TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We alsodemonstrate the algorithm’s performance on an in-house dataset of meningioma tissue. We predicted the sourcepatient of a slide with 𝐹1 scores of up to 80.1% and 77.19% on the LSCC and LUAD datasets, respectively,and with 77.09% on our meningioma dataset. Based on our findings, we formulated a risk assessment schemeto estimate the risk to the patient’s privacy prior to publication.




In this study, we utilized three distinct datasets of which two arepublicly available. Those two datasets, namely lung adenocarcinoma(LUAD) (National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC), 2018a) and lung squamous cell carcinoma(LSCC) (National Cancer Institute Clinical Proteomic Tumor AnalysisConsortium (CPTAC), 2018b), were obtained from TCIA (Clark et al.,2013). In the remainder of this paper, these datasets will be referred toas the LUAD dataset and the LSCC dataset. These datasets were scannedat a resolution of 0.5 μm per pixel and were obtained from variouspathology centers. We restricted our analysis to slides of patients forwhich at least two slides were available, resulting in 1059 images of226 patients for the LUAD dataset and 1071 images of 209 patients ofthe LSCC dataset.

在本研究中,我们使用了三个不同的数据集,其中两个是公开可用的。这两个公开数据集分别是肺腺癌(LUAD)(National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC), 2018a)和肺鳞状细胞癌(LSCC)(National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC), 2018b),它们来自TCIA(Clark等,2013)。在本文的后续部分,这两个数据集将分别称为LUAD数据集和LSCC数据集。这些数据集以每像素0.5 μm的分辨率进行扫描,来自多个病理中心。我们将分析范围限制在至少包含两张切片的患者,最终LUAD数据集包含226名患者的1059张图像,LSCC数据集包含209名患者的1071张图像。



This work demonstrates that re-identification of patients fromhistopathology images of resected tumor specimens is possible, withsome limitations. As long as the slides originate from the same tumor,we can re-identify the patients with considerable accuracy (as canbe seen in Tables 1 and 2). If the slides were resected at differentpoints in time, the accuracy is considerably lower (see Table 3). Asuccessful resection should completely remove the tumor, and hencea later resection resembles a regrowth of an incomplete resection or anew tumor of potentially different pathogenesis and mutational pattern.Our results indicate that the strong performance drop could be linkedto different morphological tumor characteristics. Consequently, ourapproach is more likely to identify tumors than patients.Which visual factors in particular contribute to the re-identificationis a question for future research. However, even if the models wouldheavily rely on traces related to slide preparation to re-identify theslides, this would threaten patient privacy. Therefore if these factorswould imprint some kind of implicit visual time stamp, future workcan focus on how to remove these traces from the slides.Our results indicate that the safest way of publishing histopathologyimages is to only use each patient in one data publication, as tracingacross datasets and hence recombination of multiple meta and imagedatasets is feasible, especially if slides originating from the same tumorare used in different datasets.






For all experiments, we report the recall@1, recall@5, the precisionand the 𝐹1 score. These values are always the average values over allclasses. When considering recall@n, it means that for an algorithm’spredictions to be considered correct, the searched patient has to beincluded among the 𝑛 patients with the highest-ranked predictionsbased on the classification score. In a multiclass classification problem,the average recall@1 equals the balanced accuracy. For comparison,the probability of selecting the right patient by chance when assessing𝑁 patients is also given for each dataset.

5.1. Results of experiment 1On all three datasets, the methods demonstrated satisfactory performance in re-identifying the patients based on histology slides. Detailedresults can be found in Table 1. On the MEN dataset, the TransMILclassifier in conjunction with the UNI feature extractor achieved thehighest recall@1 with 77.09%. The comparison of different featureencoders for projecting WSI-patches into latent space revealed thatthe UNI model yielded the most favorable results across all comparedWSI-classifiers. In general, each of the feature extractors pre-trainedon histopathology data demonstrated superior performance comparedto the ImageNet baseline. We also found satisfactory performances onthe two public datasets. Using the TransMIL approach with the UNIfeature extractor, we achieved a recall@1 of 77.19% on the LUAD andof 80.01% on the LSCC dataset.The results of Experiment 1 with stain augmentation are presentedin Table 2. In comparison to the results without augmentation presented in Table 1, the performances of all compared models decreasedwith the use of stain augmentation. The drop was most pronounced forthe patch-based and naive-MIL approaches, with recall@1 decreasingfrom 74.92% to 65.25% and from 74.98% to 61.13%, respectively. Incontrast, the drop in performance was less pronounced for the WSIclassifiers. Of these, the CLAM model exhibited the greatest drop inrecall@1, with a decrease from 66.54% to 62.97%, while the TransMILapproach yielded a constantly high recall@1 value of 76.14%.

5.2. Results of experiment 2When the models were trained on the earliest resection and testedon later resections of the MEN dataset, the performance dropped remarkably compared to Experiment 1 (see Table 3). The highest performance among the compared methods was observed for the TransMILmethod, with a recall@1 of 15.10% and a recall@5 of 29.25%. Eventhough the individual results were lower than in Experiment 1, theyall remained considerably above the respective probabilities of randomguessing.

5.3. Results of the post hoc analysis of experiments 1 and 2Fig. 6 illustrates the L2-distance between latent space embeddingsof test samples and their corresponding latent space anchors for Experiments 1 and 2. In both experiments, the correctly classified sampleswere observed to have a considerably smaller distance to their respective latent space anchors than the misclassified samples. Additionally,the distances between test samples and their latent space anchors wereconsiderably smaller in Experiment 1 than in Experiment 2.

在所有实验中,我们报告了 recall@1、recall@5、precision(精确率) 和 𝐹1 分数。这些指标均为各类别的平均值。对于 recall@n,其定义为算法的预测被认为正确的条件是目标患者需包含在基于分类分数排名前 n 的患者中。在多分类问题中,recall@1 的平均值等同于平衡准确率。此外,我们还提供了对每个数据集在评估 N 名患者时,随机选择正确患者的概率。

5.1 实验1的结果在三个数据集中,基于组织学切片的患者再识别方法均表现出令人满意的性能,详细结果见表1。在MEN数据集中,结合UNI特征提取器的TransMIL分类器实现了最高的 recall@1,达到了77.09%。对于将WSI(全切片图像)图像块投影到潜在空间的不同特征编码器的比较结果显示,UNI模型在所有比较的WSI分类器中表现最优。总体而言,所有基于组织病理学数据预训练的特征提取器性能均优于基于ImageNet的基线。

在两个公开数据集上,我们也得到了满意的结果。采用TransMIL方法结合UNI特征提取器,在LUAD数据集上实现了 recall@1 为77.19%,在LSCC数据集上则达到了80.01%。表2展示了使用染色增强后的实验1结果。与表1中未使用增强的结果相比,所有模型的性能均有所下降。对于基于图像块和naive-MIL的方法,这种下降尤为显著,recall@1 分别从74.92%降至65.25%,从74.98%降至61.13%。相比之下,对于WSI分类器,这种下降不太明显。其中,CLAM模型的 recall@1 降幅最大,从66.54%降至62.97%,而TransMIL方法则保持了较高的 recall@1 值,为76.14%。

5.2 实验2的结果

当模型使用MEN数据集中最早的切除样本进行训练,并在后续切除样本上进行测试时,与实验1相比性能显著下降(见表3)。在所有比较方法中,TransMIL方法表现最佳,recall@1 为15.10%,recall@5 为29.25%。尽管结果较实验1明显偏低,但仍显著高于随机猜测的概率。

5.3 实验1和实验2的事后分析结果




Fig. 1. Overview of randomly selected patches from the three datasets used. In contrast to our in-house meningioma dataset (MEN), the lung adenocarcinoma (LUAD) and lungsquamous cell carcinoma (LSCC) datasets originating from TCIA exhibit a more pronounced visual variance. Each patch covers an area of about 0.012 square millimeters


Fig. 2. Scheme of the tissue preparation procedure used to prepare the slides in thein-house meningioma (MEN) dataset. A resection can be divided into one or morecontainers, each of which can be further divided into one or more blocks. However,

only one slide from each block is included in the data set

图2. 自有脑膜瘤(MEN)数据集中用于制备切片的组织处理流程示意图。一个切除样本可以被分成一个或多个容器,每个容器又可以进一步分成一个或多个蜡块。然而,数据集中仅包含每个蜡块对应的一个切片。


Fig. 3. Scheme of how the online stain augmentation was applied in the naive-MILmodel. During training, each of the images within one bag was augmented separately.

图3. 在naive-MIL模型中应用在线染色增强的示意图。在训练过程中,同一包内的每张图像都被单独进行增强处理。


Fig. 4. Given are versions of the same patch to which different intensities of stainaugmentation were applied. A stain augmentation based on the Macenkos stainnormalization method was used. The non augmented patch is given in the center ofthe grid.

图4. 展示了对同一图像块应用不同强度染色增强后的版本。染色增强基于Macenkos染色归一化方法,未增强的图像块位于网格的中心。


Fig. 5. Overview of the experimental setup of Experiments 1 and 2. Experiment 1 involved a tenfold Monte Carlo cross-validation. In Experiment 2, the slides from the earliestresection were used for training, while all images from later resections were used in a hold-out test dataset. To increase the statistical validity of the results of Experiment 2, tenmodels for each algorithm were trained on ten randomly selected training and validation splits drawn from the earliest resection of each patient.

图5. 实验1和实验2的实验设置概览。实验1采用十折蒙特卡洛交叉验证方法。实验2中,来自最早切除手术的切片用于训练,而所有来自后续切除手术的图像被用于保留测试数据集。为了增加实验2结果的统计有效性,每种算法分别训练了十个模型,这些模型基于从每位患者最早切除手术中随机选择的训练集和验证集划分而得。


Fig. 6. Distances between test samples and their respective latent space anchors. Subfigure (a) shows the distances for Experiment 1 and sub-figure (b) shows the distancesfor Experiment 2. In general, correctly classified samples are closer to their respectivelatent space anchors

图6. 测试样本与其对应潜在空间锚点之间的距离。子图(a)展示了实验1的距离,子图(b)展示了实验2的距离。总体而言,正确分类的样本更接近其对应的潜在空间锚点。


Fig. 7. Risk assessment scheme for estimating patient privacy risks when publishing histopathology images.

图7. 用于评估发布组织病理学图像时患者隐私风险的风险评估方案。



Table 1Results of Experiment 1. The respective means and standard deviations of the tenfold Monte Carlo cross-validation are given. In a multiclassclassification problem, the mean recall is equal to the balanced accuracy. Random probability is the probability of selecting the correct patientby random guessing



Table 2Results of Experiment 1 while using strong stain augmentation during training. The respective means and standard deviations of the tenfoldMonte Carlo cross-validation are given. In a multiclass classification problem, the mean recall is equal to the balanced accuracy. Randomprobability is the probability of selecting the correct patient by random guessing.



Table 3Results of Experiment 2. In a multiclass classification problem, the balanced accuracy equals the average recall. Random probability is theprobability of selecting the correct patient by random guessing



Table A.1Results of the preliminary investigation of the optimal magnification level for patch sampling. Given are the results of the tenfold Monte Carlocross-validation using the MEN dataset and the patch-based model. In each experiment, patches with a width and height of 512 pixels wereused. The spatial resolution is given in microns per pixel (mpp)





