【404文库】协和医学院规培医生董袭莹博士论文, 中国知网已删除

CDT编辑注:日前,中日友好医院医生肖飞被其妻举报婚外情,其中一位当事人为规培医师、协和医学院博士生董袭莹。举报信中还指出,肖飞曾在手术过程中与护士发生冲突,期间不顾患者安危,与董袭莹一同离开手术室。随后,董袭莹的教育背景引发网络关注。公开信息显示,董袭莹本科就读于美国哥伦比亚大学下属的女子学院——巴纳德学院,主修经济学,2019年被北京协和医学院“4+4”临床医学长学制试点班录取为博士研究生。相比中国传统医学教育体系通常更长的教学与规培周期,协和“4+4”模式在公平性与专业性方面引发质疑。与此同时,有网民指出,董袭莹的博士论文与北京科技大学的一项发明专利存在多处雷同,涉嫌学术不端。事件持续发酵后,中国知网已将其博士论文下架,而这篇论文的下架流程是否遵循既定的撤稿标准亦受到质疑。有网友在此之前保存了论文PDF,并上传至GitHub。以下文字和图片由CDT通过PDF版转录存档。
北京协和医学院临床医学专业毕业论文
学校代码:10023
学 号: B2019012012
跨模态图像融合技术在医疗影像分析中的研究
专业年级:北京协和医学院临床医学专业 2019 级试点班
姓名: 董袭莹
导师: 邱贵兴(教授)
北京协和医学院 临床学院(北京协和医院)
骨科
完成日期:2023年5月
目录
摘要 …………………………………………………………………………………………………………………… 1
Abstract ………………………………………………………………………………………………………………. 3
- 基于特征匹配的跨模态图像融合的宫颈癌病变区域检测 ………………………………… 6
1.1. 前言 …………………………………………………………………………………………………….. 6
1.2. 研究方法 ……………………………………………………………………………………………… 7
1.2.1. 研究设计和工作流程 ………………………………………………………………….. 7
1.2.2. 跨模态图像融合 …………………………………………………………………………. 7
1.2.3. 宫颈癌病变区域检测 ………………………………………………………………… 11
1.3. 实验 …………………………………………………………………………………………………… 11
1.3.1. 临床信息和影像数据集 …………………………………………………………….. 11
1.3.2. 模型训练过程 …………………………………………………………………………… 12
1.3.3. 评价指标 ………………………………………………………………………………….. 13
1.3.4. 目标检测模型的结果与分析 ……………………………………………………… 14
1.4. 讨论 …………………………………………………………………………………………………… 15
1.5. 结论 …………………………………………………………………………………………………… 16 - 基于特征转换的跨模态数据融合的乳腺癌骨转移的诊断 ………………………………. 18
2.1. 前言 …………………………………………………………………………………………………… 18
2.2. 研究方法 ……………………………………………………………………………………………. 19
2.2.1. 研究设计和工作流程 ………………………………………………………………… 19
2.2.2. 骨转移目标区域检测 ………………………………………………………………… 20
2.2.3. 基于特征转换的跨模态数据融合 ………………………………………………. 20
2.2.4. 乳腺癌骨转移的分类模型 …………………………………………………………. 21
2.3. 实验 …………………………………………………………………………………………………… 22
2.3.1. 临床信息和影像数据集 …………………………………………………………….. 22
2.3.2. 模型训练过程 …………………………………………………………………………… 23
2.3.3. 评价指标 ………………………………………………………………………………….. 24
2.3.4. 单模态骨转移灶检测模型及基于特征转换的跨模态分类模型的结果与分析 ……………………………………………………………………………………………….. 25
2.4. 讨论 …………………………………………………………………………………………………… 30
2.5. 结论 …………………………………………………………………………………………………… 32
全文小结 ………………………………………………………………………………………………………….. 33
参考文献 ………………………………………………………………………………………………………….. 35
缩略词表 ………………………………………………………………………………………………………….. 40
文献综述 ………………………………………………………………………………………………………….. 41 - 跨模态深度学习技术在临床影像中的应用 ……………………………………………………. 41
3.1 Preface ………………………………………………………………………………………………… 41
3.2. Deep Neural Network (DNN) ……………………………………………………………….. 42
3.2.1. Supervised learning ……………………………………………………………………. 43
3.2.2. Backpropagation ………………………………………………………………………… 46
3.2.3. Convolutional neural networks (CNN) …………………………………………. 46
3.3. Cross-modal fusion ………………………………………………………………………………. 49
3.3.1. Cross-modal fusion methods ……………………………………………………….. 50
3.3.2. Cross-modal image translation …………………………………………………….. 51
3.4. The application of cross-modal deep learning ………………………………………….. 52
3.5. conclusion …………………………………………………………………………………………… 55
参考文献 …………………………………………………………………………………………………… 57
致谢 …………………………………………………………………………………………………………………. 60
独创性声明 ………………………………………………………………………………………………………. 61
学位论文版权使用授权书 …………………………………………………………………………………. 61
摘要
背景
影像学检查是医疗领域最常用的筛查手段,据统计,医疗数据总量中有超过90%是由影像数据构成[1]。然而,根据亲身参与的临床病例[2]可知,很多情况下,仅凭医生的肉眼观察和主观诊断经验,不足以对影像学异常作一明确判断。而诊断不明引起的频繁就医、贻误病情,则会严重影响患者的生活质量。
相较于传统的主观阅片,人工智能技术通过深度神经网络分析大量影像和诊断数据,学习对病理诊断有用的特征,在客观数据的支持下做出更准确的判断。为了模拟临床医生结合各种成像模式(如 CT、MRI 和 PET)形成诊断的过程,本项目采用跨模态深度学习方法,将各种影像学模态特征进行有机结合,充分利用其各自的独特优势训练深度神经网络,以提高模型性能。鉴于肿瘤相关的影像学资料相对丰富,本项目以宫颈癌和乳腺癌骨转移为例,测试了跨模态深度学习方法在病变区域定位和辅助诊断方面的性能,以解决临床实际问题。
方法
第一部分回顾性纳入了220 例有FDG-PET/CT 数据的宫颈癌患者,共计72,602张切片图像。应用多种图像预处理策略对PET 和CT 图像进行图像增强,并进行感兴趣区域边缘检测、自适应定位和跨模态图像对齐。将对齐后的图像在通道上级联输入目标检测网络进行检测、分析及结果评估。通过与使用单一模态图像及其他 PET-CT 融合方法进行比较,验证本项目提出的 PET-CT 自适应区域特征融合结果在提高模型目标检测性能方面具有显著性优势。第二部分回顾性纳入了233 例乳腺癌患者,每例样本包含 CT、MRI、或 PET 一至三种模态的全身影像数据,共有3051 张CT 切片,3543 张MRI 切片,1818 张PET 切片。首先训练YOLOv5 对每种单一模态图像中的骨转移病灶进行目标检测。根据检测框的置信度划分八个区间,统计每个影像序列不同置信度区间中含有检出骨转移病灶的个数,并以此归一化后作为结构化医疗特征数据,采用级联方式融合三种模态的结构化特征实现跨模态特征融合。再用多种分类模型对结构化数据进行分类和评估。将基于特征转换的跨模态融合数据与特征转换后的单模态结构化数据,以及基于 C3D 分类模型的前融合方式进行比较,验证第二部分提出的方法在乳腺癌骨转移诊断任务中的优越性能。
结果
第一部分的基于跨模态融合的肿瘤检测实验证明,PET-CT 自适应区域特征融合图像显著提高了宫颈癌病变区域检测的准确性。相比使用CT 或PET 单模态图像以及其他融合方法生成的多模态图像作为网络输入,目标检测的平均精确度分别提高了 6.06%和 8.9%,且消除了一些假阳性结果。上述测试结果在使用不同的目标检测模型的情况下保持一致,这表明自适应跨模态融合方法有良好的通用性,可以泛化应用于各种目标检测模型的预处理阶段。第二部分基于特征转换的跨模态病例分类实验证明,跨模态融合数据显著提高了乳腺癌骨转移诊断任务的性能。相较于单模态数据,跨模态融合数据的平均准确率和AUC分别提高了7.9%和8.5%,观察 ROC 曲线和 PR 曲线的形状和面积也具有相同的实验结论:在不同的分类模型中,使用基于特征转换的跨模态数据,相比单模态数据,对于骨转移病例的分类性能更为优越。而相较于基于 C3D 的前融合分类模型,基于特征转换的后融合策略在分类任务方面的性能更优。
结论
本项目主要包含两个部分。第一部分证实了基于区域特征匹配的跨模态图像融合后的数据集在检测性能上优于单模态医学图像数据集和其他融合方法。第二部分提出了一种基于特征转换的跨模态数据融合方法。使用融合后的数据进行分类任务,其分类性能优于仅使用单模态数据进行分类或使用前融合方法的性能。根据不同模态医学图像的特征差异与互补性,本项目验证了跨模态深度学习技术在病变区域定位和辅助诊断方面的优势。相比于只使用单模态数据进行训练的模型,跨模态深度学习技术有更优的诊断准确率,可以有效的成为临床辅助工具,协助和指导临床决策。
关键词:跨模态融合,深度学习,影像分析,宫颈癌,乳腺癌骨转移
Abstract
Background
Imaging examinations serve as the predominant screening method in the medical field. As statistics reveal, imaging data constitute over 90% of the entire medical dataset. Nonetheless, clinical cases have demonstrated that mere subjective diagnoses by clinicians often fall short in making definitive judgments on imaging anomalies. Misdiagnoses or undiagnosed conditions, which result in frequent hospital visits and delayed treatment, can profoundly affect patients’ quality of life.
Compared to the traditional subjective image interpretation by clinicians, AI leverages deep neural networks to analyze large-scale imaging and diagnostic data, extracting valuable features for pathology diagnosis, and thus facilitating more accurate decision-making, underpinned by objective data. To emulate clinicians’ diagnostic process that integrates various imaging modalities like CT, MRI, and PET, a cross-modal deep learning methodology is employed. This approach synergistically merges features from different imaging modalities, capitalizing on their unique advantages to enhance model performance.
Given the ample availability of oncologic imaging data, the project exemplifies the efficacy of this approach in cervical cancer segmentation and detection of breast cancer bone metastasis, thereby addressing pragmatic challenges in clinical practice.
Methods
The first part retrospectively analyzed 72,602 slices of FDG-PET/CT scans from 220 cervical cancer patients. Various preprocessing strategies were applied to enhance PET and CT images, including edge detection, adaptive ROI localization, and cross-modal image
fusion. The fused images were then concatenated on a channel-wise basis and fed into the object detection network for the precise segmentation of cervical cancer lesions. Compared to single modality images (either CT or PET) and alternative PET-CT fusion techniques,
the proposed method of PET-CT adaptive fusion was found to significantly enhance the object detection performance of the model. The second part of the study retrospectively analyzed 3,051 CT slices, 3,543 MRI slices and 1,818 PET slices from 233 breast cancer patients, with each case containing whole-body imaging of one to three modalities (CT, MRI, or PET). Initially, YOLOv5 was trained to detect bone metastases in images across different modalities. The confidence levels of the prediction boxes were segregated into eight tiers, following which the number of boxes predicting bone metastases in each imaging sequence was tallied within each confidence tier. This count was then normalized and utilized as a structured feature. The structured features from the three modalities were fused in a cascaded manner for cross-modal fusion. Subsequently, a variety of classification models were then employed to evaluate the structured features for diagnosing bone metastasis. In comparison to feature-transformed single-modal data and the C3D early fusion method, the cross-modal fusion data founded on feature transformation demonstrated superior performance in diagnosing breast cancer bone metastasis.
Results
The first part of our study delivered compelling experimental results, showing a significant improvement in the accuracy of cervical cancer segmentation when using adaptively fused PET-CT images. Our approach outperformed other object detection algorithms based on either single-modal images or multimodal images fused by other methods, with an average accuracy improvement of 6.06% and 8.9%, respectively, while also effectively mitigating false-positive results. These promising test results remained consistent across different object detection models, highlighting the robustness and universality of our adaptive fusion method, which can be generalized in the preprocessing stage of diverse object detection models. The second part of our study demonstrated that cross-modal fusion based on feature transformation could significantly improve the performance of bone metastasis classification models. When compared to algorithms employing single-modal data, models based on cross-modal data had an average increase in accuracy and AUC of 7.9% and 8.5%, respectively. This improvement was further corroborated by the shapes of the ROC and PR curves. Across a range of classification models, feature-transformed cross-modal data
consistently outperformed single-modal data in diagnosing breast cancer bone metastasis. Moreover, late fusion strategies grounded in feature transformation exhibited superior performance in classification tasks when juxtaposed with early fusion methods such as C3D.
Conclusions
This project primarily consists of two parts. The first part substantiates that deep learning object detection networks founded on the adaptive cross-modal image fusion method outperform those based on single-modal images or alternative fusion methods. The second part presents a cross-modal fusion approach based on feature transformation. When the fused features are deployed for classification models, they outperform those utilizing solely single-modal data or the early fusion model. In light of the differences and complementarity in the features of various image modalities, this project underscores the strengths of cross-modal deep learning in lesion segmentation and disease classification. When compared to models trained only on single-modal data, cross-modal deep learning offers superior diagnostic accuracy, thereby serving as an effective tool to assist in clinical decision-making.
Keywords: cross-modal fusion, deep learning, image analysis, cervical cancer, breast
cancer bone metastasis
1. 基于特征匹配的跨模态图像融合的宫颈癌病变区域检测
1.1. 前言
宫颈癌是女性群体中发病率第四位的癌症,每年影响全球近 50 万女性的生命健康[3] 。准确和及时的识别宫颈癌至关重要,是否能对其进行早期识别决定了治疗方案的选择及预后情况[4]。氟代脱氧葡萄糖正电子发射计算机断层显像/电子计算机断层扫描(fluorodeoxyglucose-positron emission tomography/computed tomography, FDG-PET/CT),因其优越的敏感性和特异性,成为了一个重要的宫颈癌检测方式[5] 。由于CT 能够清晰地显示解剖结构,FDG-PET 能够很好地反映局灶的代谢信息形成功能影像,FDG-PET/CT 融合图像对可疑宫颈癌病灶的显示比单独使用高分辨率 CT 更准确,特别是在检测区域淋巴结受累和盆腔外病变扩展方面[6] ,[7] ,[8] 。然而,用传统方法为单一患者的 FDG-PET/CT 数据进行分析需要阅读数百幅影像,对病变区域进行鉴别分析,这一极为耗时的过程已经妨碍了临床医生对子宫颈癌的临床诊断。
随着计算机硬件和算法的进步,尤其是以深度学习[9] 、图像处理技术[10] ,[11] 为代表的机器学习技术的革新,这些人工智能算法在临床医学的许多领域中起着关键作用[12]。基于其强大有效的特征提取能力[13] ,[14] ,深度学习中的卷积神经网络可以通过梯度下降自动学习图像中的主要特征[15 ,极大地提高目标识别的准确性 [16],使深度学习成为计算机图像处理领域的主流技术[17] ,[18]。利用深度学习技术对宫颈癌影像进行分析可以辅助临床医生做出更为准确的判断,减轻临床医生的工作负担,提高诊断的准确性[19]。
目前已经有很多在单一模态图像中(CT 或 PET)基于深度学习技术进行病变检测的工作:Seung等使用机器学习技术依据PET图像对预测肺癌组织学亚型[20];Sasank 进行了基于深度学习算法检测头 CT 中关键信息的回顾性研究[21];Chen 使用随机游走(random walk)和深度神经网络对 CT 图像中的肺部病变进行分割[22] ,[23]。但很少有关于使用跨模态图像融合深度学习方法进行病变检测的研究。
基于 PET/CT 融合图像的病变检测项目包括三个研究任务:区域特征匹配[24],跨模态图像融合[25 和目标病变区域检测[26]。Mattes 使用互信息作为相似性标准,提出了一种三维PET 向胸部CT 配准的区域特征匹配算法[27]。Maqsood 提出了一种基于双尺度图像分解和稀疏表示的跨模态图像融合方案[28]。Elakkiya 利用更快的基于区域的卷积神经网络(Faster Region-Based Convolutional Neural Network, FR-CNN)进行颈部斑点的检测[29]。目前还没有将上述三个研究任务,即区域特征匹配、跨模态图像融合、病变区域检测任务,结合起来的研究工作。
为了减轻临床医生的工作负担,基于跨模态深度学习方法,本项目的第一部分提出了一个统一的多模态图像融合和目标检测框架,用于宫颈癌的检测。
1.2. 研究方法
1.2.1. 研究设计和工作流程
本项目旨在检测 CT 和 PET 图像中宫颈癌的病变区域,工作流程如图 1-1 所示:扫描设备对每位患者进行PET 和CT 图像序列的采集;通过区域特征匹配和图像融合来合成清晰且信息丰富的跨模态图像融合结果;采用基于深度学习的目标检测方法在融合图像中对可疑宫颈癌的病变区域进行目标检测。在图 1-1 的最后一行中,矩形框出的黄色区域及图中右上角放大的区域中展示了检测出的宫颈癌病变区域。
图 1-1 工作流程
1.2.2. 跨模态图像融合
图 1-2 展示了跨模态图像融合算法的流程图。根据计算发现两种模态图像的比例和位置不同,如仅进行简单的融合会错误地将处于不同位置的组织影像重叠,从而使组织发生错位,定位不准,产生不可接受的误差。因此,第一部分提出了一种跨模态图像融合策略,其中的步骤包括对感兴趣区域(region of interest, ROI)的自适应定位和图像融合。
在PET 和CT 图像中,自适应ROI定位能够精准识别待分析处理的关键目标,即人体组织影像,然后计算不同模态图像下组织影像之间的比例和位移。依据上述计算结果通过缩放、填充和裁剪的方式来融合 PET 和CT 图像。
图 1-2 CT 和PET 跨模态图像融合算法的流程图
1.2.2.1. 自适应ROI 定位
鉴于数据集中 PET 图像与 CT 图像的黑色背景均为零像素值填充,ROI 内非零像素值较多,而 ROI 边缘的非零像素值较少,因此,选用线检测方法来标画两种模态图像中的 ROI,最终标划结果如图 1-2 中的绿色线框出的部分所示,这四条线是 ROI 在四个方向上的边界。在不同方向上计算比例尺。在将 PET 图像放大后,根据ROI实现CT 和PET 图像的像素级对齐。裁剪掉多余的区域,并用零像素值来补充空白区域。如图 1- 3(a)所示,线检测方法从中心点出发,向四个方向即上下左右对非零像素值进行遍历,并记录下行或列上的非零像素值的数量。如图1- 3(b)所示,红色箭头代表遍历的方向。在从 ROI 中心向边缘进行遍历时,沿遍历经线上的非零像素值数量逐渐减少,如果某一线上非零像素值的计数低于预设的阈值,那么意味着该线已经触及到 ROI 的边缘,如图 1- 3(c)所示。然而,如果直接对未经预处理的图像应用线检测方法,会因受模糊边缘及其噪声的影响,得到较差的对齐结果,难以设置阈值。因此,需对PET 和CT 图像单独执行图像增强预处理,以优化 ROI 标化结果,改善跨模态融合效果。
由于PET 和CT 图像具有不同的纹理特征,应用不同的预处理策略,分别对图像进行增强处理,以强化 ROI 的边缘特性,同时消除噪声产生的干扰,再在两种不同模态图像中进行 ROI 定位,如图1-2 所示。
图 1-3 ROI 检测示意图
CT 图像是用 X 射线对检查部位一定厚度的组织层进行扫描,由探测器接收透过该层面的 X 射线,经数字转换器及计算机处理后形成的解剖学图像。CT 图像通常比 PET 图像更清晰。为了提取 ROI,需利用图像增强技术对 CT 图像进行预处理:首先,通过图像锐化增强边缘特征和灰度跳变部分,使 CT 图像的边缘(即灰度值突变区域)信息更加突出;由于锐化可能导致一定的噪声,再使用高斯模糊滤波器(Gaussian blur)[30 进行图像平滑去噪,将噪声所在像素点处理为周围相邻像素值的加权平均近似值,消除影响成像质量的边缘毛躁;并执行Canny边缘检测(Canny edge detection )[31 来设定阈值并连接边缘,从而在图像中提取目标对象的边缘。尽管Canny边缘检测算法已包含高斯模糊的去噪操作,但实验证实两次高斯模糊后的边缘提取效果更优。在对图像进行锐化处理后,将提取的边缘图像与高斯模糊后的图像进行叠加。具体地,对两个图像中的每个像素直接进行像素值相加,最终得到边缘更清晰且减轻噪声影响的增强后 CT 模态图像。
PET 图像是基于间接探测到的由正电子放射性核素发射的 γ 射线,经过计算机进行散射和随机信息的校正,形成的影像,能够显示体内代谢活动的信息。尽管PET 可以显示分子代谢水平,但由于成像原理的差异,PET 图像相较于 CT 图像显得模糊。对PET 的预处理方式与对CT 图像的类似,但省略了高斯模糊处理图像噪声的步骤,因为在锐化 PET 模态图像后产生的噪声较少,为防止有效特征信息的丢失,略过这一环节。
为了将两个模态的图像进行区域特征匹配,使用PET 和CT 图像中的矩形ROI框来计算缩放比例和位移参数,并通过缩放、填充和裁剪操作对PET 和CT 图像中的ROI 进行对齐。
1.2.2.2. 图像融合
CT 和 PET 图像的尺寸分别为 512×512 像素和 128×128 像素,ROI 特征区域位于图像的中心位置。通过缩放、零值填充和剪切,放大PET 图像的尺寸以与CT 图像的尺寸保持一致,并且将两个模态图像之间的 ROI 对齐,以便后续的融合处理。经处理的PET 和CT 图像转化为灰度形式,分别进行加权和图像叠加,将其置于不同通道中,作为网络的输入层。由于 PET 图像能展示体内分子层面的代谢水平,其对于肿瘤检测的敏感性高于CT 图像。因此,本研究的图像融合方法为PET 图像的ROI 分配了更多权重,以提高宫颈癌检测任务的表现。
图 1-4 比较了本项目提出的自适应图像融合的结果和直接融合的结果,选取人体不同部位的 CT、PET 和 PET/CT 图像的融合结果进行展示。第一和第二列分别展示了未经处理的原始 CT 和 PET 图像。简单融合算法仅将两个图像的像素点相加,并未执行特征匹配过程,得到的融合图像无任何实用价值。由于通道拼接融合后的图像转变为高维多模态数据,而非三通道数字图像,因此图 1-4 并未展示通道拼接融合方法所得图像。而本项目提出的自适应图像融合方法实现了跨模态图像的精准融合,可用于进一步的观察和计算。
图 1-4 不同图像融合方式的可视化结果
1.2.3. 宫颈癌病变区域检测
先由两位临床医生对跨模态融合图像中的病变区域进行人工标注,并训练YOLOv5 [32] 目标检测网络来识别融合图像中的病灶区域,如图 1-5 所示。模块骨架用于提取图像的深层特征,为减少通过切片操作进行采样过程中的信息损失,采用聚焦结构,并使用跨阶段局部网络(cross-stage partial network, CSPNet)[33] 来减少模型在推理中所需的计算量。头模块用于执行分类和回归任务,采用特征金字塔网络(feature pyramid network, FPN)和路径聚合网络(path aggregation network, PAN)[34]。
为了提高对极小目标区域的检测效果,输入层采用了mosaic数据增强(mosaic data augmentation)[35] 方法,将四个随机缩放、剪切和随机排列的图像拼接在一起。模块骨架包括 CSPNet 和空间金字塔池化(spatial pyramid pooling, SPP)[36] 操作。输入图像通过三个 CSP 操作和一个 SPP 操作,生成了一个四倍于原始大小的特征图。头模块有三个分支网络,分别接收来自不同层的融合特征、输出各层的边界框回归值和目标类别,最后由头模块合并分支网络的预测结果。
图 1-5 目标检测网络结构
1.3. 实验
1.3.1. 临床信息和影像数据集
本项目选取符合以下条件的患者开展研究:1)于 2010 年 1 月至 2018 年 12月期间在国家癌症中心中国医学科学院肿瘤医院被诊断为原发性宫颈癌的患者 2)有FDG-PET/CT 图像;3)有电子病历记录。总共入组了 220 名患者,共计 72,602 张切片图像,平均每位患者有 330 张切片图像入组实验。其中,CT 切片图像的高度和宽度均为512 像素,而PET 切片图像的高度和宽度均为128 像素,每个模态的数据集都包含了6,378张切片图像,即平均每位患者有29张切片图像,用于训练和测试。在入组进行分析之前,所有患者数据都已去标识化。本研究已获得北京协和医学院国家癌症中心伦理委员会的批准。
该数据集包含220 个患者的全身 CT 和全身 PET 图像数据,因入组的每位患者均确诊为宫颈癌,数据集中各例数据均包含病变区域,如表 1-1 所示。鉴于所有患者的CT 和PET 均在同一时间且使用相同设备采集,因此 CT 和PET 展示的解剖信息与代谢信息来自同一时刻患者身体的同一区域,其特征具有一对一对应且可匹配的特性。根据肿瘤大小、浸润深度、盆腔临近组织侵犯程度、腹盆腔淋巴结转移的情况可将宫颈癌的进展程度进行分期,主要包括四期,每期中又进一步细分为更具体的期别。国际妇产科联盟(International Federation of Gynecology and Obstetrics,
FIGO)于 2018 年 10 月更新了宫颈癌分期系统的最新版本[37]。本项目数据集囊括了 FIGO 分期全部四个期别的宫颈癌影像。为了保持训练和测试的公平性,纳入训练集和测试集的不同期别影像的分布,即不同 FIGO 分期的划分比例,需保持一致,否则可能会导致某些 FIGO 期别的数据集无法进行训练或测试。因此,在保证处于不同期别的患者数据的划分比例的基础上,采用五折交叉验证方法将220 名患者的数据进行五等分,每个部分大约包括了 45 例患者的数据,在每轮验证中随机选择一个部分作为测试集。所有模型都需要进行5 次训练和评估,以获取在测试集上表现出的性能的平均值和标准差。
表 1-1 数据集中的病例数及临床分期
1.3.2. 模型训练过程
在按上述步骤准备好数据集后,首先将图像从 512×512 像素调整为 1024×1024像素,然后使用多种数据增强方法,包括 mosaic 增强[38]、HSV(Hue, Saturation, Value)颜色空间增强[39]、随机图像平移、随机图像缩放和随机图像翻转,增加输入数据集对噪声的鲁棒性。在每次卷积后和激活函数前进行批归一化(Batch Normalization, BN)[40] 。所有隐藏层都采用 Sigmoid 加权线性单元 (Sigmoid-Weighted Linear Units, SiLU)[41 作为激活函数。训练模型所用的学习率设置为 1e-5,并在起始训练时选择较小的学习率,然后在 5 个轮次(epoch)后使用预设的学习
率。每个模型使用PyTorch 框架在4 个Nvidia Tesla V100-SXM2 32G GPU 上进行50个轮次的训练。使用 0.98 的动量和 0.01 的权值衰减通过随机梯度下降法(Stochastic Gradient Descend, SGD)来优化各网络层的权重目标函数。在训练过程中,网络在验证集上达到最小的损失时,选择最佳参数。所有实验中的性能测量都是在采用最优参数设置的模型中对测试集进行测试得到的,详见表 1-2。
为了进一步证明本项目所提出的模型的普适性,选择了六个基于深度学习的目标检测模型作为基准,并测试了所有模型在输入不同的图像融合结果时的性能。每个模型的输入完全相同,而唯一的区别是神经网络中的超参数来自每个模型的官方设置,而这些超参数因模型而异。
表 1-2 网络训练的超参数
1.3.3. 评价指标
本项目使用“准确度精值50”(Accuracy Precision 50, AP50)来评估目标检测的性能。AP50 是当交并比(Intersection over Union, IOU)阈值为0.5 时的平均精度,如公式3所定义,其中P和R分别是精度(Precision)和召回率(Recall)的缩写。模型的预测结果会有不同的召回率和精度值,这取决于置信度阈值。将召回率作为横轴,精度作为纵轴,可以绘制 PR 曲线,而 AP 是该曲线下的面积。IOU 是将真实标注区域和模型预测区域的重叠部分除以两区域的集合部分(即真实区域和预测区域的并集)得到的结果,如公式4 所示。精度和召回率的计算方式分别在公式1 和2 中列出,其中真正例表示预测为正例的正样本,假正例和假负例代表的概念以此类推。精度表明在模型预测结果里,被判断为正例的样本中有多少实际是正例,而召回率表示实际为正例的样本中多少被预测为正例。表 1-3 记录了图像数据集交叉验证后各个目标检测模型的 AP50 的平均值和方差。
1.3.4. 目标检测模型的结果与分析
本项目采用不同的目标检测模型,包括单阶段目标检测模型(YOLOv5 [32]、RetinaNet[42]、ATSS [43 )和二阶段目标检测模型(Faster-RCNN [44
、CascadeRCNN [45]、LibraRCNN[46]),在五折交叉验证下比较了使用 CT 图像、PET 图像、PET-CT 简单融合图像、PET-CT 通道拼接融合图像(concat fusion)和本项目所提出的 PET-CT自适应区域特征融合图像作为输入数据集时,每个模型的目标检测性能。其中,CT和PET是单模态图像,而PET-CT简单融合图像、PET-CT通道拼接图像和PET-CT 自适应区域特征融合图像是跨模态融合图像。简单融合是指将 PET 图像简单地缩放到与 CT 图像相同的大小后进行像素值的叠加,而通道拼接融合是直接将两种模态图像在通道上串联在一起作为网络的输入。
如表 1-3 所示,加粗的数字代表每行中最好的实验结果。与使用单一模态数据进行肿瘤检测模型分析(如只使用CT 或PET 图像)相比,本项目所提出的自适应跨模态图像融合方法在目标检测任务中展现出了更高的检测精度。由于自适应融合方法能够在跨模态融合之前将两种模态图像的信息进行预对齐,对 CT 图像和PET图像的结构特征进行一一配准,因此,与简单融合方法和通道拼接融合方法相比,自适应融合方法的性能最佳。上述针对不同模态图像及使用不同跨模态融合方法作为输入得到的测试性能结果在使用不同的目标检测模型的情况下保持一致,这表明本项目所提出的跨模态自适应融合方法有良好的通用性,可以泛化应用到各种目标检测模型的预处理中。
表 1-3 五折交叉验证目标检测实验的结果(“*” 表示交叉验证中的某一折在训练过程中出现梯度爆炸,数值为目标检测模型的 AP50 的均值和方差)
图 1-6 将不同模态图像下目标宫颈癌病变区域的检测结果和实际标注的癌灶区域进行了可视化。其中绿色框是由医师标注的真实病变区域,黄色框是目标检测模型的预测结果。分析图像模态信息可知,CT 图像既包含了人体正常结构的信息,也包含病灶的解剖信息,前者可能会干扰宫颈癌病变区域特征的识别和检测。因此,在单一 CT 模态下会有一些漏检。与 CT 模态的预测框相比,PET 模态下的预测框与标注框的 IOU 更高,或许是由于 PET 影像有更多能表现宫颈癌区域特征的信息。在 PET-CT 区域特征跨模态融合图像中检测效果最佳,因为 PET-CT 融合图像融合了两种模态的不同特征,从而大大提高检测的准确性。
图 1-6 跨模态融合图像的目标检测结果
1.4. 讨论
本项目旨在评估深度学习算法是否可以跨模态融合 FDG-PET 和 CT 图像,并在融合图像中实现宫颈癌病灶区域的自动检测。我们提出了一个基于跨模态融合图像的检测框架,包括区域特征匹配、图像融合和目标检测等步骤。融合 CT 和PET 图像可以最大程度地提取各个模态中包含的信息,因此 PET-CT 跨模态融合图像含有丰富的解剖和功能信息。目标检测实验证明,本项目提出的跨模态融合方法得到的融合图像显著提高了目标检测的准确性,相比单模态和其他融合方法得到的多模态图像,目标检测平均精确度分别提高了 6.06%和8.9%。
表 1-3 展示了基于不同的图像融合方法形成的多模态图像,不同检测模型在五折交叉验证下的结果。因在解剖和功能影像中均有异常表现的区域更可能是癌变,我们推测,图像信息对齐有利于对宫颈癌病灶的目标区域检测。图 1-6 展示了在不同目标检测模型和不同输入图像数据模态下目标检测效果的可视化图像。基于本项目提出的跨模态融合方法生成的图像进行的目标检测的检测结果更为准确,并消除了一些假阳性结果。根据医生的日常诊断习惯,生成了以红色和黄色为主色的融合图像。
利用 FDG-PET/CT 对宫颈癌进行及时、准确的分期能够影响患者的临床治疗决策,进而延缓疾病进展,并减少肿瘤治疗相关的整体财务负担[47] 。对 FDG-PET/CT 图像的解释在很大程度上依赖临床上获得的背景信息,并需要综合临床分析来确定是否发生癌症的浸润和转移[48]。在某些情况下,核医学科阅片医师可以迅速识别局部扩展和淋巴栓塞。而多数情况下,核医学科医师分析一位患者的FDG-PET/CT 影像学检查结果平均需要三个小时。比起占用医师昂贵且稀缺的时间,利用计算机进行此项工作既能节约成本,预计耗时又短,且可以全天候运行。本项目的目标是通过人工智能方法实现PET 和CT 图像的自动融合,并利用目标检测技术识别宫颈癌的浸润和转移,作为辅助工具加速 FDG-PET/CT 的阅片过程,从而使临床医生能够在最短的时间内按照 FIGO 指南对宫颈癌进行分期。
这项研究仍存在一些局限性。虽然本项目对基于 PET-CT 自适应融合图像的目标检测方法与其他最先进的基于深度学习的目标检测方法进行了比较,但将该方法拓展应用到其他病种的影像学分析的可行性仍需评估。此外,我们提出的跨模态融合框架在图像融合时并未考虑每种模态图像的权重分布。未来可以设计一种特殊的损失函数来调整 ROI 内每个像素的权重分布,以提高目标检测结果的准确性。
1.5. 结论
本项目提出了一种基于跨模态图像融合的多模态图像进行病变区域检测的深度学习框架,用于宫颈癌的检测。为了应对医学影像中单一模态图像在肿瘤检测方面的性能不足,提出了一种基于区域特征匹配的自适应跨模态图像融合策略,将融合后的多模态医学图像输入深度学习目标检测模型完成宫颈癌病变区域检测任务,并讨论了深度学习模型在每种模态图像输入间的性能差异。大量的实验证明,与使用单一模态的影像及基于简单融合方法或通道拼接融合方法的多模态影像相比,自适应融合后的多模态医学图像更有助于宫颈癌病变区域的检测。
本项目所提出的技术可实现 PET 和CT 图像的自动融合,并对宫颈癌病变区域进行检测,从而辅助医生的诊断过程,具备实际应用价值。后续将基于第一部分的目标检测模型基础,利用特征转换的方法,将图像数据转换为结构数据,将跨模态融合方法应用于分类问题。
2. 基于特征转换的跨模态数据融合的乳腺癌骨转移的诊断
2.1. 前言
骨骼是第三常见的恶性肿瘤转移部位,其发生率仅次于肺转移和肝转移,近 70%的骨转移瘤的原发部位为乳腺和前列腺[49] ,[50] 。骨转移造成的骨相关事件非常多样,从完全无症状到严重疼痛、关节活动度降低、病理性骨折、脊髓压迫、骨髓衰竭和高钙血症。高钙血症又可导致便秘、尿量过多、口渴和疲劳,或因血钙急剧升高导致心律失常和急性肾功能衰竭[51 。骨转移是乳腺癌最常见的转移方式,也是患者预后的分水岭,其诊断后的中位生存期约为 40 个月[52] ,[53] 。因此,及时发现骨转移病灶对于诊断、治疗方案的选择和乳腺癌患者的管理至关重要。目前,病灶穿刺活检是诊断骨转移的金标准,但鉴于穿刺活检有创、存在较高风险、且假阴性率高,临床常用影像学检查部分替代穿刺活检判断是否发生骨转移。
Batson的研究表明,乳腺的静脉回流不仅汇入腔静脉,还汇入自骨盆沿椎旁走行到硬膜外的椎静脉丛[54] 。通过椎静脉丛向骨骼的血液回流部分解释了乳腺癌易向中轴骨和肢带骨转移的原因。因潜在骨转移灶的位置分布较广,影像学筛查需要覆盖更大的区域,常要求全身显像。常用的骨转移影像诊断方法包括全身骨显像(whole-body bone scintigraphy, WBS)、计算机断层扫描(computed tomography, CT)、磁共振成像(magnetic resonance imaging, MRI)和正电子发射断层显像(positron emission tomography, PET)[55]。CT 可以清晰地显示骨破坏,硬化沉积,和转移瘤引起软组织肿胀;MRI 具有优异的骨和软组织对比分辨率;因 [18F] 氟化钠会特异性地被骨组织吸收、代谢, PET 可以定位全身各处骨代谢活跃的区域。然而,单一模态影像常不足以检测骨转移,且用传统方法综合单一患者的 CT、MRI、PET 数据筛查骨转移病灶需要对上千幅影像进行解读,这一极为耗时的过程可能影响临床医生对乳腺癌骨转移的诊断,造成误诊、漏诊。而骨转移的漏诊会误导一系列临床决策,导致灾难性后果。
作为一种客观评估体系,人工智能辅助骨转移自动诊断系统通过减少观察者间和观察者内的变异性,提高了诊断的一致性和可重复性,降低了假阴性率。在减轻临床医师的工作负担的同时,提高诊断的准确性。目前已经有很多在单一模态图像中(CT、MRI 或 PET)基于深度学习技术进行骨转移病变检测的工作: Noguchi 等人开发了一种基于深度学习的算法,实现了在所有 CT 扫描区域中对骨转移病灶的自动检测 [56];Fan 等人用 AdaBoost 算法和 Chan-Vese 算法在 MRI 图像上对肺癌的脊柱转移病灶进行了自动检测和分割肺[57];Moreau等人比较了不同深度学习算法在 PET/CT 图像上分割正常骨组织和乳腺癌骨转移区域的性能[58] 。但很少有使用跨模态数据融合的深度学习方法,判断是否存在骨转移灶的相关研究。
旨在减轻临床医生的工作负担,本章提出了基于特征转换的跨模态数据融合方法,用于分析 CT、MRI 和 PET 图像,以判断其中是否存在乳腺癌骨转移病灶。
基于特征转换的 CT、MRI 和 PET 跨模态图像数据融合,进行骨转移病变分类(即存在骨转移病灶和不存在骨转移病灶两类)项目包括三个研究任务:目标病变区域检测,特征构造及转换和分类任务。具体地,采用目标检测模型对不同模态的医学图像序列数据进行单独的骨转移瘤目标检测,再对这些检测结果进行特征提取。所提取的特征包括不同模态下检测结果置信度的区间占比、检测框的面积大小、检测框在图像中的空间位置分布等。这些特征被整理成结构化数据格式,完成了从非结构化影像数据到结构化数据特征的特征转换和融合过程。最后,将转换后的特征输入分类模型进行分类任务。实验比较了基于特征转换的跨模态数据融合方法在乳腺癌骨转移肿瘤分类任务的性能,与仅使用单模态数据执行分类
任务的性能。同时,还将本项目提出的基于特征转换的融合策略与其他融合方法进行了对比。
2.2. 研究方法
2.2.1. 研究设计和工作流程
本项目旨在判断 CT、MRI、PET 图像序列中是否存在乳腺癌骨转移病灶。工作流程如图 2-1 所示:扫描设备对每位患者进行 CT、MRI 或 PET 图像序列的采集;使用目标检测模型分别在不同模态图像中对可疑乳腺癌骨转移灶进行目标检测;对检测结果进行特征提取、构造和融合,得到具有可解释性的结构化医疗数据;用分类模型对结构化数据进行分类任务,得出预测结果,从而判断乳腺癌骨转移是否发生。
图 2-1 工作流程
2.2.2. 骨转移目标区域检测
先由两位临床医师对多模态数据集图像中的骨转移病灶进行人工标注,并对患者进行分类(标签分为乳腺癌骨转移和非乳腺癌骨转移),并训练 YOLOv5 目标检测网络,以识别各个单一模态图像中的乳腺癌骨转移病灶。
2.2.3. 基于特征转换的跨模态数据融合
在本项目的数据集中,各种模态序列影像的扫描范围均涵盖了患者的全身。某患者的影像序列(不论是单模态图像还是多模态图像)中检测到含有骨转移病灶的切片图像数量越多,则意味着该患者发生乳腺癌骨转移的概率越大。根据这一基本推理,采用后融合方法,将一个影像序列中含有肿瘤切片图像的比例(百分比)作为结构化的数据特征,作为后续分类任务的依据。
具体操作如下:在每个模态的图像中完成骨转移区域的目标检测任务训练后,统计每个图像序列中检测到转移瘤目标区域的检测框数量。按照检测框的置信度划分为 8 个区间:10%~20%、20%~30%、30%~40%、40%~50%、50%~60%、60%~70%、70%~80%和大于 80%。在每个区间内,分别统计各模态图像序列中转移瘤检测框数量,再除以该序列中切片图像的总数,得到每个置信度区间内每种模态图像序列中含有转移瘤检测框的百分比。接着将三种模态图像提取出的统计特征拼接,组成结构化数据,实现跨模态数据融合。若患者缺失某种模态数据,相应的统计特征(百分比)将被置为零。特征转换后的结构化数据如图 2-2 所示,每种模态数据包括 8 个特征,即不同的置信区间,最后一列为标签值,其中“0”表示负例,“1”表示正例。
图 2-2 特征转换后的结构化数据
2.2.4. 乳腺癌骨转移的分类模型
利用构建好的结构化医疗特征进行乳腺癌骨转移分类任务,融合跨模态图像数据特征判断是否发生乳腺癌骨转移。本项目采用分类模型主要以模式识别基础模型为主,包括SVM[59]、AdaBoost[60]、RandomForest[61]、LightGBM[62]、GBDT[63]。SVM 是一种基于核函数的监督学习模型,用于解决分类问题,通过寻找最优超平面在特征空间中将样本分为不同类别,决策函数映射输入特征到输出标签,核函数将特征映射到新空间,损失函数度量决策函数性能,最大化超平面与样本间距离实现分类,可使用不同核函数处理高维特征。Adaboost 是一种迭代算法,于1995 年由 Freund Y 等人提出,能够将多个弱分类器结合成一个强分类器,通过选择初始训练集、训练弱分类器、加权重新分配样本和重复训练直到训练完成所有弱学习器,最后通过加权平均或投票得出最终决策。由 Breiman L 等人于 2001 年提出得 RandomForest 是一种基于决策树的机器学习算法,可用于分类和回归任务。通过构建多个决策树并对它们的预测结果进行平均或投票来得出最终预测结果,训练过程中随机选择特征,以避免过拟合并减少计算量。机器学习模型 LightGBM是一种基于决策树的梯度提升机算法,由Ke G 等人在2017 年提出,适用于结构化数据的分类任务。具有高效、内存友好、支持并行处理和多 CPU 等特点,能快速处理大量特征,通过基于直方图的决策树算法减少训练时间和内存使用量。通过损失函数的泰勒展开式来近似表示残差来计算损失函数。由 Friedman J H 等人于2001 年提出的 GBDT 是一种迭代的决策树算法,通过构建多个决策树来拟合目标函数,每一步都在上一步的基础上构建新的决策树,以不断减小误差,流程包括选取子集、训练弱学习器、梯度下降法最小化误差,最终将弱学习器加入总体模型,重复以上步骤直至达到最优解。
2.2.4.1. 基于C3D 的跨模态数据融合分类模型
本项目采用C3D[64 分类模型作为对照模型,基于3D 卷积神经网络的深度学习方法,使用跨模态数据融合中的前融合策略。如图 2-1 所示,该融合策略从每个模态的图像序列中筛选出一部分,合并一个完整的多模态图像序列,并在通道上进行级联,进行跨模态数据融合。融合后的数据作为 3D 卷积神经网络的输入,经过多个 3D 卷积层提取特征,最终在全连接层中执行分类任务,以判断影像中是否存在乳腺癌骨转移病灶。
2.3. 实验
2.3.1. 临床信息和影像数据集
本项目选取符合以下条件的患者开展研究:1)于 2000 年 01 月至 2020 年 12月期间在北京协和医院或国家癌症中心中国医学科学院肿瘤医院被诊断为原发性乳腺癌的患者 2)有全身 CT 或 PET 或 MRI 其中任一模态的全身影像数据;3)有电子病历记录。入组患者中有145名被确诊为乳腺癌骨转移,作为正例样本,有88名患者未发生乳腺癌骨转移,作为负例样本。每例样本数据包含一至三种不同模态的图像序列,其图像尺寸和切片图像数量各异。乳腺癌骨转移的多模态医学图像数据集对患者的全身进行采样,由于患者的 CT、MRI 或 PET 是不同时间、在不同设备上采集的,不同模态间的特征并非一一匹配。其中,CT 模态共有3051 张切片,MRI 模态共有 3543 张切片,而 PET 模态共有 1818 张切片。在入组进行分析之前,所有患者数据都已去标识化。本研究已通过北京协和医院伦理委员会批准。
该数据集可以用于执行目标检测任务和分类任务。
骨转移目标检测任务仅分析数据集中的正例样本,进行五折交叉验证:将 145例患者的数据按模态分为三组(CT组、MRI组、PET组),在每个组内对数据进行五等分,在每轮验证中选取一部分作为测试集。为获得测试性能的平均值,所有模型都需进行5 次训练和评估。
在利用结构化数据执行分类任务时,需要平衡正负样本数量,因此要扩充数据集。将具有多种模态的样本拆分为包含较少模态的样本,如将“CT+MRI+PET”类型拆分为“CT+MRI”或“CT+PET”等。如表 2-1 所示,扩充后共有 380 例样本数据,包括188 个正样本和 192 个负样本。下一步,合并五折交叉验证的目标检测结果,此后,进行特征构建和转换,从而获得适合跨模态数据融合和分类任务的结构化数据;对于负样本数据,也需要在合并骨转移目标检测模型的推理结果后,对数据进行结构化处理。
为证实在乳腺癌骨转移判断的分类任务中,基于特征转换的跨模态融合数据性能优于单一模态数据,需要进行多模态融合数据与单模态数据的对照实验。如表 2-1 所示,单模态数据包括仅有 CT、仅有 MRI 和仅有 PET 三种类型的数据集合,总计 212 个样本,而多模态数据涵盖了CT+MRI、CT+PET、MRI+PET 和CT+MRI+PET 四种类型,共计 168 个样本。分别对单模态数据和多模态数据进行独立划分,将每种模态数据进行五等份,进行五折交叉验证。在每轮验证中,选择一部分作为测试集。利用 SVM、AdaBoost、RandomForest、LightGBM、GBDT以及 C3D 模型进行实验,每个模型都需进行 5 轮训练和评估,以获得测试集上性能的平均值。
为适应 C3D 模型对图像统一尺寸的要求,针对不同患者切片数量、大小的差异,进行预处理。在每种模态图像序列中等间隔抽取 60 张图像切片,并进行缩放,使其组合为 180 张 128×128 像素的切片。对于缺失的模态数据,用 60 张零像素值的黑色图像切片进行填充。从 180 张切片中随机选取一个起始位置,连续抽取 120张切片作为模型的最终输入,确保输入尺寸为 128×128×120 像素。
表 2-1 扩充后的分类数据集
2.3.2. 模型训练过程
在按上述步骤准备好数据集后,进行目标检测任务训练时,将每个模态的图像大小统一到1024×1024像素,然后使用多种数据增强方法,增加输入数据集对噪声的鲁棒性。
目标检测模型采用 YOLOv5 模型模型并使用 PyTorch 深度学习框架在 2 个Nvidia Tesla V100-SXM2 32G GPU 上进行 70 个轮次的训练。初始学习率为 0.00001,使用 0.98 的动量和 0.01 的权值衰减通过 SGD 来优化各网络层的权重目标函数。在训练过程中,网络在验证集上达到最小的损失时,选择最佳参数。
进行分类任务时,采用 LightGBM、GBDT、AdaBoost、RandomForest 以及SVM,上述模型均为机器学习模型,其超参数对模型预测结果影响较大,在分类任务中,采用SVM、AdaBoost、RandomForest、LightGBM以及GBDT等机器学习模型。因其超参数会对预测结果产生较大影响,在训练过程中,使用网格搜索策略为这些模型寻找最佳参数。网格搜索策略在一定范围的超参数空间内寻找最佳的超参数组合,通过枚举各种可能的组合并评估模型预测结果,最终选择表现最优的超参数组合。要搜索的超参数包括学习率、树的最大深度、叶子节点数量、随机抽样比例、权重的L1正则化项和权重的L2正则化项等。实验结果将基于最优超参数设定下的预测模型。模型训练的网络结构图如图 2-3 所示。
用于对照的 C3D 模型使用 PyTorch 深度学习框架在 1 个 NVIDIA Tesla V100-SXM2 32GB GPU 上训练 100 个轮次,初始学习率为 0.00001,使用动量为 0.9,权值衰减为0.0005 的 SGD 梯度下降优化器对各网络层权重的目标函数进行优化。
图 2-3 网络结构图
2.3.3. 评价指标
本项目中的骨转移目标检测任务采用 AP50 作为评价指标,其介绍详见上一章节。
而在分类任务中,采用准确率(Accuracy, Acc)、敏感性(Sensitivity, Sen)、特异性(Specificity, Spe)、AUC(Area Under Curve, AUC)作为评价指标,并采用ROC 曲线和PR 曲线对模型进行评估。准确率是指对于给定的测试集,分类模型正确分类的样本数占总样本数的比例,如公式 5 所示,其中真正例(True Positive, TP)表示预测为正例且标签值为正例,假正例(False Positive, FP)表示预测为正例但标签值为负例,和假负例(False Negative, FN)和真负例(True Negative, TN)代表的概念以此类推。如公式 64 和公式 7 所示,敏感性和特异性的定义分别为:预
测正确的正例占所有正例的比例,以及预测正确的负例占所有负例的比例。ROC曲线是一种评估二分类模型的方法,其横轴为假阳性率(False Positive Rate,FPR),其计算方式与上一章的召回率(Recall)相同,纵轴为真阳性率(True Positive Rate,TPR),TPR 和 FPR 的计算方式详见公式 8 和公式 9。ROC 曲线展示了在不同阈值下,TPR 与 FPR 的变化关系。因为左上角点对应的假阳性率为 0,真阳性率为 1,表明模型将所有正例样本分类正确,且未将任何负例样本误判为正例。若 ROC 曲线靠近左上角,提示模型性能较好。AUC代表ROC曲线下的面积,即从(0,0)到(1,1)进行积分测量ROC曲线下二维区域的面积。AUC综合考虑所有可能的分类阈值,提供了一个全面的性能度量。AUC 值表示随机从正负样本中各抽取一个,分类器正确预测正例得分高于负例的概率。AUC 值越接近 1,说明模型性能越优秀。PR曲线的绘制方法详见上一章, PR曲线在不同分类阈值下展示了分类器在精度(Precision, P)和召回率(Recall, R)方面的整体表现。
2.3.4. 单模态骨转移灶检测模型及基于特征转换的跨模态分类模型的结果与分析
本项目对乳腺癌骨转移多模态医学图像数据集(包括 CT、MRI、PET)进行了单模态肿瘤检测实验和基于特征转换的跨模态病例分类实验。其中,单模态肿瘤检测实验是多模态肿瘤分类实验的前置步骤。
采用单阶段目标检测模型 YOLOv5,在五折交叉验证下比较了使用单模态 CT图像、PET图像、MRI图像作为输入数据集时,模型的目标检测性能。并在将目标检测结果进行特征转换后,采用不同的分类模型,包括后融合分类模型(LightGBM、GBDT、AdaBoost、RandomForest、SVM) 和 前 融 合 分 类 模 型(C3D),在五折交叉验证下比较了使用单模态数据和跨模态融合数据作为输入时,每个模型的分类性能。
表 2-2 展示了在不同单一模态数据上,五折交叉验证得到的骨转移病灶检测结果,评估指标为 AP50。实验结果表明,PET 模态的检测精度较高,而 CT 模态的检测精度最低。输入数据量较少、检测目标面积小、转移瘤的特征难以与正常骨组织区分是提高检测精度的难点。图 2-4 不同单一模态图像下目标骨转移病变区域的检测结果和实际标注的癌灶区域进行了可视化。绿色框由医师标注,目标检测模型标注的预测框为黄色。
表 2-2 单模态骨转移灶检测五折交叉验证结果
图 2-4 可视化单模态目标检测结果
将 CT、MRI 和 PET 的数据组成单模态子数据集进行单模态分析,而将两种及两种以上的数据组成跨模态子数据集进行多模态分析。表 2-3 和表 2-4 展示了在上述两种子数据集中进行五折交叉验证的结果,对比了6 种不同模型每一折的准确率、AUC,及其平均值。这 6 类模型中的前 5 种模型使用后融合策略,而作为对照的C3D 模型采用前融合策略。对比表 2-3 和表 2-4 的实验结果可知,在任一模型(包括前融合模型)中,基于特征转换的跨模态融合数据在乳腺癌骨转移分类任务上相较于仅使用单模态数据的性能有所提高:平均准确率提高了7.9%;平均AUC 提高了8.5%。如表2-5 和2-6 所示,跨模态融合方法比单模态方法的平均敏感性提高了7.6%,平均特异性提高了 9.4%。
表 2-3 基于单模态子数据集进行分类任务的准确率和 AUC
表 2-4 基于跨模态子数据集行分类任务的准确率和 AUC
表 2-5 单模态数据分类的敏感性和特异性
表 2-6 跨模态融合数据分类的敏感性和特异性
图 2-5 和图 2-6 分别展示了 6 个模型利用单模态数据进行分类实验和利用特征转换和融合后的跨模态数据进行分类实验的 PR 曲线。可以根据曲线形状和曲线下方面积来评估不同模型的性能表现,曲线下面积越大,提示模型的性能越优秀。综合观察单模态和跨模态分类实验的P-R曲线图,可以发现,基于跨模态数据的分类任务的P-R 曲线下面积大于基于单模态数据的分类任务的P-R 曲线下面积,提示跨模态数据作为输入时分类模型的表现更加出色。
比较基于单模态数据进行分类的模型的 P-R 曲线,可见 3D 卷积网络的训练方式相较于其他后融合模型的性能表现更优。然而,在基于跨模态数据进行分类的模型的 P-R 曲线中,基于特征转换的跨模态后融合策略相对于基于 3D 卷积的前融合方法具有更好的性能。
图 2-5 基于单模态数据不同分类模型的的PR 曲线
图 2-6 基于跨模态数据不同分类模型的的PR 曲线
图 2-7 和图 2-8 展示了 6 种分类模型在使用单模态数据进行分类实验和使用跨模态数据进行分类实验的情况下的ROC 曲线。通过对比观察六个模型的ROC 曲线的形状和面积来评估不同模型的性能。靠近左上角的 ROC 曲线表示假阳性率接近0,真阳性率接近 1,趋近于左上角的 ROC 曲线提示模型性能优越。对比图 2-7 和图 2-8 可知,使用基于特征转换的跨模态数据的骨转移病例分类模型的性能更为优越。
图 2-7 基于单模态数据不同分类模型的的ROC 曲线
图 2-8 基于跨模态数据不同分类模型的的ROC 曲线
本文提出的跨模态数据融合方法是基于特征转换的后融合策略,相较于前融合策略具有更好的性能。实验表明,无论采用前融合或后融合策略,基于跨模态融合数据的实验都表现出了显著的优势。相较于多模态数据,单一模态数据所捕获的特征较为单一(如仅有结构信息) ,可能由于缺乏关键和全面的特征信息导致模型性能不佳,而跨模态融合方法则能从不同模态中获取更多的有效特征,并将其融合,从而提高准确率。
2.4. 讨论
本项目旨在评估基于特征转换的跨模态数据融合方法是否可以跨模态融合 CT、MRI 和 PET 图像的有效特征,以对乳腺癌患者进行是否发生骨转移的评估。本项目提出了一个基于特征转换的跨模态融合图像数据框架,用于对骨转移病变进行分类,包括目标病变区域检测、特征构造及融合形成可解释的结构化数据以及跨模态融合数据分类步骤。融合 CT、MRI 和 PET 的转换特征数据能够充分利用各个模态中的信息,为分类任务提供更多的数据支持,并增加辅助判断的特征线索。基于特征转换的跨模态病例分类实验证明,本项目提出的跨模态融合数据显著提高了对影像序列进行二分类任务的性能,相较于单模态数据,平均准确率和 AUC分别提高了7.9%和 8.5%。
如表 2-2 所示,单模态目标检测模型在 PET 图像中的检测精度较高,而在 CT和 MRI 图像中的精度相对较低。图 2-4 展示了在 YOLOv5 目标检测模型中不同单一输入图像模态下乳腺癌骨转移检测效果的可视化图像。分析各种图像模态信息可知,CT 和 MRI 图像不仅包含病灶的解剖信息,还包含了人体正常组织的结构信息,而后者可能会干扰宫颈癌病变区域特征的识别和检测,导致单一CT或MRI模态下出现漏检现象。与之不同,PET图像展示的是组织代谢信息。骨转移病灶通常伴随着频繁的成骨和破骨活动,在 PET 影像中呈高代谢,而正常骨组织的代谢相对较缓慢,通常不会显示在图像中。因此,PET 对背景组织的干扰较 CT、MRI 更不敏感,有助于目标检测模型识别异常代谢区域。因早期无症状骨转移病灶通常体积较小,可能因目标区域面积过小影响目标检测结果。在模型处理过程中,池化(pooling)操作可能导致特征或图像信息的损失,从而造成特征缺失。为了克服这一问题,后续研究可以关注提高模型在处理小目标区域时的性能。
表 2-4 和表 2-6 展示了在各种分类模型中,基于跨模态结构化数据在五折交叉验证下的分类性能。通过对比分析发现,相对于基于 C3D 的前融合分类模型,基于特征转换的后融合策略在性能方面有所提高。医学影像数据有数据量较少、维度高、结构复杂以及样本识别难度大等特点,这导致将特征提取任务交由模型完成,直接输入原始数据或经过简单预处理的数据,让模型自主进行特征提取并生成最终输出的这种端到端的前融合方法效果不尽如人意。由于患者在体型、身高等方面的个体差异,一个图像序列内的CT和MRI切片数量也有所不同。因此,需要对图像进行归一化处理,将其转换为统一的标准格式,如调整到相同尺寸、修正切割后图像中心的位置等。归一化操作旨在对数据进行统一格式化和压缩,但这可能会导致图像未对齐、图像与特征错位、数据压缩过度以及特征丢失等问题。因此,采用基于特征转换的后融合策略可能更合适本项目。前融合所采用的 C3D分类模型是一种在三维数据上进行分析的网络模型。三维数据具有尺度高、维度大以及信息稀疏等特点。尽管 C3D 网络训练过程中增加了一个维度的信息,但同时也提高了算法分析的复杂性,特别是在模型训练过程中,占用了大量显存等硬件资源,可能导致批归一化不理想和网络收敛不完全的问题。与 C3D 相比,本文提出的二阶段后融合方法实现了特征压缩,提取置信度这种可解释的特征,并去除了无关的稀疏特征。在有限的硬件资源和数据量的限制下,这种方法能更好地学习数据特征,起到了类似正则化(通过在损失函数中添加约束,规范模型在后续迭代过程中避免过度拟合)的效果。
乳腺癌骨转移病灶在代谢和结构方面都较正常骨组织显著不同。因此,通过融合 CT、MRI 和 PET 图像的特征信息,实现解剖和功能信息的跨模态融合,能更有效地完成分类任务,帮助诊断乳腺癌骨转移。然而,综合分析全身 CT、MRI 和PET图像信息需要医师投入大量时间,且存在较大的观察者间差异。一旦发生漏诊,会导致严重后果。利用计算机辅助医师判断乳腺癌是否发生骨转移不仅可以节省成本和时间,还能提供更加客观的评估标准。计算机辅助诊断工具可以综合多模态图像的结果进行特征转换和分析,预防漏诊的发生。因此,在未来的研究中,可以重点关注开发此类计算机辅助诊断系统,以提高乳腺癌骨转移诊断的准确性和效率。
这项研究仍存在一些局限性。从单个影像模态中提取的特征较为单一,仅有置信区间,可以在后续的训练中可以从临床角度出发加入更多可能影响骨转移判断的因素作为分类特征,如检测目标的面积,或增加中轴骨检出目标的权重。因本项研究具有多模态影像数据的病例量不够,未来可以尝试除五折交叉验证之外其他的模型训练方法以降低数据量对分类模型性能的影响。
2.5. 结论
本项目提出了一种基于特征转换的跨模态数据融合方法进行分类任务的深度学习框架,用于判断是否发生乳腺癌骨转移。首先独立对不同模态的医学图像数据进行肿瘤检测,根据目标检测结果进行特征构造,并将其组织成结构化数据的形式,完成从非结构化数据特征到结构化数据特征的转换与融合。最终,将结构化数据特征输入分类器,进行骨转移的分类任务,并对照 C3D 前融合模型,讨论了基于特征转换方法进行跨模态数据后融合的优势。大量的实验证明,使用基于特征转换的跨模态融合数据进行分类任务的性能优于基于单模态数据的分类性能;使用本项目提出的后融合策略执行分类任务较使用前融合策略的分类模型(C3D)的性能更好。
本项目所提出的技术可综合 CT、MRI 和 PET 模态数据的特征,对乳腺癌患者是否发生骨转移进行判断,辅助临床医师进行乳腺癌骨转移病灶的筛查,具备实际应用价值,也为在医学图像分析任务中更有效地应用跨模态融合方法,提供了关键的理论支持。
全文小结
目前,医学影像学的解读大量依赖临床医生个人的主观诊断经验,人工阅片易漏诊小目标,难以推广及表述,具有一定的局限性。与此相比,人工智能技术可以通过深度神经网络对大量积累的影像数据和诊断数据进行分析,学习并提取数据中对病理诊断有用的特征,从而在数据支持下做出更客观的判断。按成像方式不同,医学影像数据可分为多种模态,如B超、CT、MRI、PET。为了最大限度模拟临床医生结合不同模态影像检查结果形成诊断的过程,设计人工智能模型时,应将各种影像学模态的特征进行有效的融合,即本项目中应用的跨模态深度学习方法,充分利用不同模态图像的独特优势训练深度神经网络,从而提高模型性能。本项目以宫颈癌和乳腺癌骨转移为例,验证了跨模态深度学习方法在病变区域定
位和辅助诊断方面的性能。
在第一部分中,我们回顾性纳入了220例有FDG-PET/CT数据的宫颈癌患者,共计 72,602 张切片图像。通过图像增强、边缘检测,实现 PET 和 CT 图像的 ROI自适应定位,再通过缩放、零值填充和剪切的方式,将两种模态图像的 ROI 对齐。经过加权和图像叠加,进行图像融合,将融合后的图像作为目标检测网络的输入层,进行宫颈癌病变区域检测。实验证明,相比使用单一 CT 图像、单一 PET 图像、PET-CT 简单融合图像、PET-CT 通道拼接融合图像作为网络输入,PET-CT 自适应区域特征融合图像显著提高了宫颈癌病变区域检测的准确性,目标检测的平均精确度(AP50)分别提高了 6.06%和 8.9%,且消除了一些假阳性结果,展现出可观的临床应用价值。
在第二部分中,我们回顾性纳入了 233 例乳腺癌患者,每例样本数据包含 CT、MRI、或 PET 一至三种模态的全身影像数据,共有 3051 张 CT 切片,3543 张 MRI切片,1818 张 PET 切片。首先训练 YOLOv5 目标检测网络,对每种单一模态图像中的骨转移病灶进行目标检测。统计每个影像序列中含有检出骨转移病灶的个数和置信度,将每个置信区间内含有目标检测框的百分比作为结构化医疗特征数据。采用级联方式融合三种模态的结构化特征,得到具有可解释性的结构化医疗数据,再用分类模型进行分类,预测是否发生骨转移。实验证明,相较于单模态数据,跨模态融合数据显著提高了乳腺癌骨转移诊断任务的性能,平均准确率和 AUC 分别提高了 7.9%和 8.5%,观察 ROC 曲线和 PR 曲线的形状和面积也有相同的实验结论。在不同的分类模型(SVM、AdaBoost、RandomForest、LightGBM、GBDT)中,使用基于特征转换的跨模态数据,相比单模态数据,对于骨转移病例的分类性能更为优越。而相较于基于 C3D 的前融合分类模型,基于特征转换的后融合策略在分类任务方面的性能更优。
综上所述,本文基于人工智能深度学习算法,针对不同模态医学图像的特征差异与互补性,进行多模态医学影像数据的跨模态融合,提高了模型的肿瘤检测和分类性能,检测模型和分类模型可以辅助影像学阅片过程,具有显著的临床实际应用价值。
参考文献
[1] 陈思源, 谭艾迪, 魏双剑, 盖珂珂. 基于区块链的医疗影像数据人工智能检测模型[J] . 网络安全与数据治理, 2022, 41(10): 21-25.[2] Dong X, Wu D. A rare cause of peri-esophageal cystic lesion[J . Gastroenterology, 2023, 164(2): 191-193.
[3] Arbyn M, Weiderpass E, Bruni L, et al. Estimates of incidence and mortality of cervical cancer in 2018: a worldwide analysis[J . The Lancet Global Health, 2020, 8(2): e191-e203.
[4] Marth C, Landoni F, Mahner S, et al. Cervical cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up[J . Annals of Oncology, 2017, 28:iv72-iv83.
[5] Gold M A. PET in Cervical Cancer—Implications forStaging,’Treatment Planning, Assessment of Prognosis, and Prediction of Response[J . Journal of the National Comprehensive Cancer Network, 2008, 6(1): 37-45.
[6] Gandy N, Arshad M A, Park W H E, et al. FDG-PET imaging in cervical cancer[C] .Seminars in nuclear medicine. WB Saunders, 2019, 49(6): 461-470.
[7] Grigsby P W. PET/CT imaging to guide cervical cancer therapy[J . Future Oncology, 2009, 5(7): 953-958.
[8] Mirpour S, Mhlanga J C, Logeswaran P, et al. The role of PET/CT in the management of cervical cancer[J . American Journal of Roentgenology, 2013, 201(2): W192-W205.
[9] LeCun Y, Bengio Y, Hinton G. Deep learning[J . nature, 2015, 521(7553): 436-444.
[10] Szeliski R. Computer vision: algorithms and applications Springer Science & Business Media[J] . 2010.
[11] Ma B, Yin X, Wu D, et al. End-to-end learning for simultaneously generating decision
map and multi-focus image fusion result[J . Neurocomputing, 2022, 470: 204-216.
[12] Anwar S M, Majid M, Qayyum A, et al. Medical image analysis using convolutional
neural networks: a review[J . Journal of medical systems, 2018, 42: 1-13.
[13] Ma B, Ban X, Huang H, et al. Deep learning-based image segmentation for al-la alloy
microscopic images[J . Symmetry, 2018, 10(4): 107.
[14] Li Z, He J, Zhang X, et al. Toward high accuracy and visualization: An interpretable feature extraction method based on genetic programming and non-overlap degree[C] . 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020: 299-304.
[15] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J] . nature, 1986, 323(6088): 533-536.
[16] He K , Gkioxari G , Dollar P , et al. Mask R-CNN[C . International Conference on Computer Vision. IEEE Computer Society, 2017, pp. 2980-2988.
[17] Ma B, Wei X, Liu C, et al. Data augmentation in microscopic images for material data mining[J] . npj Computational Materials, 2020, 6(1): 125.
[18] Ma B, Zhu Y, Yin X, et al. Sesf-fuse: An unsupervised deep model for multi-focus image fusion[J] . Neural Computing and Applications, 2021, 33: 5793-5804.
[19] Kermany D S, Goldbaum M, Cai W, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning[J . cell, 2018, 172(5): 1122-1131. e9.
[20] Hyun S H, Ahn M S, Koh Y W, et al. A machine-learning approach using PET-based radiomics to predict the histological subtypes of lung cancer[J . Clinical nuclear medicine, 2019, 44(12): 956-960.
[21] Chilamkurthy S, Ghosh R, Tanamala S, et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study[J . The Lancet, 2018, 392(10162): 2388-2396.
[22] Chen C, Xiao R, Zhang T, et al. Pathological lung segmentation in chest CT images based on improved random walker[J] . Computer methods and programs in biomedicine, 2021, 200: 105864.
[23] Chen C, Zhou K, Zha M, et al. An effective deep neural network for lung lesions segmentation from COVID-19 CT images[J] . IEEE Transactions on Industrial Informatics, 2021, 17(9): 6528-6538.
[24] Hill D L G, Batchelor P G, Holden M, et al. Medical image registration[J] . Physics in medicine & biology, 2001, 46(3): R1.
[25] Du J, Li W, Lu K, et al. An overview of multi-modal medical image fusion[J] .Neurocomputing, 2016, 215: 3-20.
[26] Watanabe H, Ariji Y, Fukuda M, et al. Deep learning object detection of maxillary cyst-like lesions on panoramic radiographs: preliminary study[J] . Oral radiology, 2021, 37: 487-493.
[27] Mattes D, Haynor D R, Vesselle H, et al. PET-CT image registration in the chest using free-form deformations[J] . IEEE transactions on medical imaging, 2003, 22(1): 120-128.
[28] Maqsood S, Javed U. Multi-modal medical image fusion based on two-scale image decomposition and sparse representation[J] . Biomedical Signal Processing and Control, 2020, 57: 101810.
[29] Elakkiya R, Subramaniyaswamy V, Vijayakumar V, et al. Cervical cancer diagnostics healthcare system using hybrid object detection adversarial networks[J] . IEEE Journal of Biomedical and Health Informatics, 2021, 26(4): 1464-1471.
[30] Al-Ameen Z, Sulong G, Gapar M D, et al. Reducing the Gaussian blur artifact from CT medical images by employing a combination of sharpening filters and iterative deblurring algorithms[J] . Journal of Theoretical and Applied Information Technology, 2012, 46(1): 31-36.
[31] Canny J. A computational approach to edge detection[J] . IEEE Transactions on pattern analysis and machine intelligence, 1986 (6): 679-698.
[32] Jocher G, Stoken A, Borovec J, et al. ultralytics/yolov5: v5. 0-YOLOv5-P6 1280 models, AWS, Supervise. ly and YouTube integrations[J . Zenodo, 2021.
[33] Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: A new backbone that can enhance learning capability of CNN[C . Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020: 390-391.
[34] Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation[C .
Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:
8759-8768.
[35] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J . arXiv preprint arXiv, 2020, 2004: 10934.
[36] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J . IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9): 1904-1916.
[37] Lee S I, Atri M. 2018 FIGO staging system for uterine cervical cancer: enter cross-sectional imaging[J . Radiology, 2019, 292(1): 15-24.
[38] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J . arXiv preprint arXiv:2004.10934, 2020.
[39] Smith A R. Color gamut transform pairs[J . ACM Siggraph Computer Graphics, 1978, 12(3): 12-19.
[40] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by
reducing internal covariate shift[C . International conference on machine learning.
pmlr, 2015: 448-456.
[41] Elfwing S, Uchibe E, Doya K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning[J . Neural Networks, 2018, 107: 3-11.
[42] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C] .Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
[43] Zhang S, Chi C, Yao Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C . Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9759-9768.
[44] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J . Advances in neural information processing systems, 2015, 28.
[45] Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C] .Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:6154-6162.
[46] Pang J, Chen K, Shi J, et al. Libra r-cnn: Towards balanced learning for object detection[C] . Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 821-830.
[47] Cohen P A, Jhingran A, Oaknin A, et al. Cervical cancer[J . The Lancet, 2019, 393(10167): 169-182.
[48] Lee S I, Atri M. 2018 FIGO staging system for uterine cervical cancer: enter cross-sectional imaging[J] . Radiology, 2019, 292(1): 15-24.
[49] Coleman R E. Metastatic bone disease: clinical features, pathophysiology and treatment strategies[J . Cancer treatment reviews, 2001, 27(3): 165-176.
[50] Cecchini M G, Wetterwald A, Van Der Pluijm G, et al. Molecular and biological mechanisms of bone metastasis[J . EAU Update Series, 2005, 3(4): 214-226.
[51] Cuccurullo V, Lucio Cascini G, Tamburrini O, et al. Bone metastases radiopharmaceuticals:
an overview[J . Current radiopharmaceuticals, 2013, 6(1): 41-47.
[52] Emens L A, Davidson N E. The follow-up of breast cancer[C . Seminars in oncology. WB Saunders, 2003, 30(3): 338-348.
[53] Chen W Z, Shen J F, Zhou Y, et al. Clinical characteristics and risk factors for developing bone metastases in patients with breast cancer[J . Scientific reports, 2017, 7(1): 1-7.
[54] Batson O V. The function of the vertebral veins and their role in the spread of metastases[J . Annals of surgery, 1940, 112(1): 138.
[55] O’Sullivan G J, Carty F L, Cronin C G. Imaging of bone metastasis: an update[J] .World journal of radiology, 2015, 7(8): 202.
[56] Noguchi S, Nishio M, Sakamoto R, et al. Deep learning–based algorithm improved radiologists’ performance in bone metastases detection on CT[J . European Radiology, 2022, 32(11): 7976-7987.
[57] Fan X , Zhang X , Zhang Z , et al. Deep Learning on MRI Images for Diagnosis of Lung Cancer Spinal Bone Metastasis[J . Contrast Media & Molecular Imaging, 2021, 2021(1):1-9.
[58] Moreau N, Rousseau C, Fourcade C, et al. Deep learning approaches for bone and bone lesion segmentation on 18FDG PET/CT imaging in the context of metastatic breast cancer[C] . 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2020: 1532-1535.
[59] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C] . Proceedings of the fifth annual workshop on Computational learning theory, 1992: 144-152.
[60] Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J] . Journal of computer and system sciences, 1997, 55(1): 119-139.
[61] Breiman L. Random forests[J . Machine learning, 2001, 45: 5-32.
[62] Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boosting decision tree[J] . Advances in neural information processing systems, 2017, 30: 52.
[63] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C . Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016: 785-794.
[64] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks[C . Proceedings of the IEEE international conference on computer vision, 2015: 4489-4497.
文献综述
3. 跨模态深度学习技术在临床影像中的应用
The Application of Deep Learning and Cross-modal Fusion Methods in Medical Imaging
Abstract
Deep learning technology is gaining widespread prominence across various fields in this era. In the realm of medical imaging, it has steadily assumed a pivotal role in tasks such as feature recognition, object detection, and image segmentation, since its inception. With the continuous evolution of imaging techniques, the individual patient often possesses an expanding wealth of multi-modal imaging data. It is evident that deep learning models utilizing cross-modal image fusion techniques will find diverse applications in a lot more clinical scenarios. In the future, deep learning will play a significant role in the medical sector, encompassing screening, diagnosis, treatment, and long-term disease management. To provide a reference for future research, this review aims to present a concise overview of the fundamental principles of deep learning, the nature of cross-modal fusion methods based on deep learning, with their wide-ranging applications, and a comprehensive survey of the present clinical uses of single-modal and cross-modal deep learning techniques in medical imaging, with a particular emphasis on bone metastasis imaging.
Keywords: deep learning, cross-modal, tumor imaging, bong metastasis
3.1 Preface
Nowadays, data are generated in massive quantities in the healthcare sector, from sources such as high-resolution medical imaging, biosensors with continuous output of physiologic metrics, genome sequencing, and electronic medical records. The limits on the analysis of such data by humans alone have clearly been exceeded, necessitating an increased reliance on machines. The use of artificial intelligence (AI), the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage. One field that has attracted particular attention for the application of AI is radiology, as the cost of medical scans is declining, and the use of imaging studies is now at least as common as physical examinations, even surpassing the latter in surgical emergencies out of humanism and accuracy concerns. AI can greatly aid clinicians in diagnosis, especially in the interpretation of radiographic images, the accuracy of which heavily relies on the clinical experience and scrutiny of their interpreters, thus freeing clinicians to devote more of their attention to providing bedside healthcare. The radiologic screening and staging of tumors rely heavily on radiologists’ subjective judgments. For some minuscule or ambivalent lesions, it is often difficult to arrive at a definitive diagnosis based solely on clinical experience. A case reported by Dong et al. proves the vulnerability of relying on error-prone human judgment[1 . AI methods mainly analyze medical images through image processing and deep learning techniques. As an assistant for clinicians, deep neural networks can be trained with large datasets of radiologic images or clinical information, automatically learning features key to the revelation of pathology or lesion localization. In addition to deep learning models based on images of a single modality, researchers have also proven the feasibility of integrating multi-modal medical imaging data in algorithms, with improved model robustness. A combination of feature representations from different imaging modalities can effectively improve the performance of tumor detection, classification, and segmentation. Artificially generating relatively scarce imaging data from more easily accessible radiographs by way of cross-modal image translation can not only aid diagnoses but also improve the performance of deep learning models.
This article will briefly review the relevant background of deep neural networks (DNNs), as well as the up-to-date development of cross-modal fusion and image translation methods. An overview of the current clinical applications of single-modal and cross-modal deep learning in tumor imaging, especially in bone metastasis imaging, is also provided.
3.2. Deep Neural Network (DNN)
Traditional machine learning methods face limitations in handling data in its raw form, as creating a suitable internal representation or feature vector requires a meticulous feature extractor designed manually to convert raw data, such as image pixels. Only after then a classifier, every detail of which was manually set and adjusted, could detect or classify patterns in the input, and spell out its outcome. Because of the varying qualities of images, lots of intricate image enhancement or filtering algorithms, such as adaptive Gaussian learning, histogram equalization, and de-noising, are designed alone for the purpose of pre-processing images to be ready for the feature extractor. Another downside of conventional machine learning is that manually coded algorithms would only, with images with finer details or better contrast, allow for automatic execution of the thought processes that can best mimic, not surpass, that of a clinician. All the key features to be extracted and used in encoding the classifier were essentially the same set of “inputs” a clinician would use to make his or her judgment.
In contrast, “representation learning” is a set of techniques utilized in deep learning, which enable machines to analyze raw data without manual feature engineering. This allows the system to automatically identify the relevant patterns or features needed for classification or detection. Pattern recognition using deep neural networks (DNNs) can help interpret medical scans, pathology slides, skin lesions, retinal images, electrocardiograms, endoscopy, faces, and vital signs. Deep learning algorithms employ multiple tiers of representation through the composition of nonlinear modules, which transform the input representation from one level to the next, beginning with the raw input and continuing to higher and more abstract levels. It is helpful to think of the entire network as, nonetheless, a “function”, that takes in a set of inputs and spills out an output, though with absurdly complicated parameters and transformations. Irrelevant variations or noise can be lost as stepping up towards the higher layers of representation that amplify only features important for discrimination. By this “layering” method, very complex functions can be learned. A key differentiating feature of deep learning compared with other subtypes of AI is its autodidactic quality, i.e. neither the number of layers nor features of each layer is designed by human engineers, unencumbered by either the essence or the flaws of the human brain.
3.2.1. Supervised learning
Supervised learning is the process of learning with training labels assigned by the supervisor, i.e. the training set of examples has its raw input data bundled with their desired outputs. When used to classify images into different categories, the machine is shown an image and produces an output in the form of a vector of scores, one for each category during training. In supervised learning, the machine receives immediate feedback on its performance when its output does not match the expected output. The aim is to assign the highest score to the desired category among all categories. This is achieved by calculating a cost function that measures the average error or distance between the output scores and the expected pattern of scores. The inputs of the cost function are the parameters of the machine. Much like the “update rule” used in the Mixed Complementarity Problem (MCP), the perceptron learning algorithm updates its internal adjustable parameters to minimize errors when it predicts the wrong category[2]. The adjustable parameters consist of weights and biases, where weights control the input-output function of the machine. The algorithm learns from its mistakes rather than successes, and the weights can number in the hundreds or millions. Weights are assigned to the connections between neurons from the input layer and one of the neurons in the next layer, in some sense representing the “strength” of a connection. The activation of a single notch of neurons in the next layer was computed by taking the weighted sum of all the activations of the first layer, e.g., greyscale values of the pixels. Just like biological neurons may have different thresholds that the graded sum of electric potentials at the cell body needed to reach for axonal propagation, the algorithm may not want its neuron to light up simply over a sum greater than 0. So, a “bias for inactivity” is introduced into the weighted sum formula. For example, if a neuron is designed to be active only if the weighted sum exceeds 10, then a 10 is subtracted from the formula before the transformation that follows. Weights represent the pixel pattern (weight assigned to each pixel can be visualized as a pixel pattern) that the algorithm identifies, while biases provide a threshold indicating the required level of weighted sum for a neuron to become meaningfully active. When the goal is to have the value of activations of the next layer between 0 and 1 and were the mapping to be smooth and linear, then the weighted sum can be pumped into a sigmoid function, i.e. 1/(1 + exp(−w)), where w is the weighted sum. A sigmoid transformation compresses the continuum of real numbers, mapping them onto the interval between 0 and 1, effectively pushing negative inputs towards zero, and positive inputs towards 1, and the output steadily increases around the input 0. Say the mapping is from p neurons from the one layer to the q neurons of the next layer, there would be p×q number of weights and q biases. These are all adjustable parameters that can be manipulated to modify the behavior of this network. In a deep learning system, changing the parameters may reflect a shift in the location, size, or shape of the representations to find “better” features to travel through the layers to get to the desired output.
At present though, preferred mappings in DNN are neither smooth nor linear. By applying a non-linear function to the input, the categories become separable by the last output layer in a linear way, resulting in a definitive category output, unlike the previously mentioned range of numerical values that would require arbitrary cut-off points to finalize the categorization. A sigmoid transformation was once popular in the era of Multilayer Perceptron, during which a machine was simply an executor of commands, and the feature detected by each layer was designed and programmed by human engineers, such that the final output, as a continuous variable, would be interpretable[3]. The rectified linear unit (ReLU) is currently the most widely used non-linear function, which introduces non-linearity into the network by setting all negative values to zero. This is in contrast to the smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), used in previous decades.
ReLU has proven to be a faster learner in deep networks compared to these other non-linear functions and allows for the training of deep supervised networks without the need for unsupervised pre-training[4].
This would not work in DNN, as hidden layers would not be picking up edges and patterns based on our expectations. How the machine gets to the correct output is still an enigma, and its intelligence still awaits revelation.
The essential of learning by neural networks is to minimize the cost function. It is important for this cost function to have a nice and smooth output so that the local minimum can be obtained by taking little steps downhill, rather than being either on or off in a binary way the way biological neurons are. To adjust the weight and bias values of the parameter vector in a high-dimensional space, the learning algorithm computes a gradient vector that specifies how much the error, or cost, would increase or decrease if each parameter were slightly modified. In mathematical terms, this is similar to taking derivatives of a function with respect to a variable to observe the trend of the function between the two infinitesimally close values of that variable. In multivariate calculus, the gradient of a function indicates the path of the steepest incline, pointing towards the direction in the input space where one should move to minimize the output of this cost function with the utmost speed, and the length of the vector indicates exactly how steep the steepest ascent is. The weight vector is modified by shifting it in the opposite direction of the gradient vector, and the size of the adjustments is proportional to the slope of the gradient vector. When the slope of the gradient vector approaches the minimum, the step size decreases to prevent overshooting. This is the so-called “gradient descent” that converges on some local minimum. Minimizing the cost function can guarantee better performance across all training samples. Viewed from a different perspective, the gradient vector of the cost function encodes the relative importance of weights and biases, which changes to which weights matter the most to minimize the cost. The magnitude of each component represents
how sensitive the cost is to each weight and bias.
In practice, most practitioners use a procedure called stochastic gradient descent (SGD). It involves randomly selecting a few input vectors as mini-batches, computing the corresponding outputs, errors, and the gradient descent step. The weights were adjusted accordingly. This process is repeated for many small subsets of examples from the training set until the average cost function stops decreasing. Each small subset of examples gives a noisy estimate of the average gradient over all examples, and thus the “stochasticity”. Despite its simplicity, SGD often achieves good results with far less computation time than more complex optimization techniques[5].
3.2.2. Backpropagation
Recursively adjusting the weights in proportion to the activation of the second-to-last layer, vise vera, or altering the biases to decrease the cost for a single training sample is a single round of digital learning. In a nutshell, the backpropagation procedure is an algorithm of computing the gradient descent efficiently. Calculating the gradient of a cost function with respect to the weights in a stacked multilayer module is a practical application of the chain rule of derivatives. A key insight is that the derivative of the cost function concerning the input can be obtained by reversing the order of the layers, working from the higher to the lower layers. The process of backpropagation entails computing gradients through all layers. From the uppermost layer where the network generates predictions down towards the lowermost layer where the external input is introduced. Once these gradients have been calculated, it is straightforward to determine the gradients with respect to the weights and biases of each module. After these gradients are computed, it becomes a straightforward task to derive the gradients with respect to the weights and biases of each module. The average of desired changes, obtained by traversing the backpropagation route for alternate training samples, was the optimal adjustment that parameters could make to make the model performs better in the training set.
It was commonly thought that a simple gradient descent would get trapped in suboptimal local minima — weight configurations for which no small change would reduce the cost function, as finding the global minimum would be an intractable task. Recent theoretical and empirical results strongly suggest the cost function’s landscape is actually filled with a huge number of saddle points where the gradient is zero, indicating that the optimization challenge is more complex than originally thought, but most of these points have similar cost function values[6]. In other words, the depth of the local minima is almost the same across different saddle points, so it is not crucial which one the algorithm gets stuck at.
3.2.3. Convolutional neural networks (CNN)
Convolutional neural networks (CNNs) are easier to train and have better generalization capabilities compared to other feedforward networks with fully connected layers. They are specifically designed to process data represented as multiple arrays, such as grayscale images consisting of a single 2D array containing pixel intensities of varying values.
The four key ideas behind CNN are inspired by the properties of natural signals and visual neuroscience: local connections, shared weights, pooling, and the use of multi-layer. The convolutional and pooling layers are directly inspired by the concept of simple cells and complex cells, respectively, in the visual cortex, and the overall architecture is reminiscent of the LGN-V1-V2-V4-IT hierarchy in the visual cortex’s ventral pathway[7][8]. Local groups of values in array data often exhibit high correlation and form characteristic local motifs that can be readily identified. Therefore, pattern recognition makes CNN most useful in detecting images.
3.2.3.1. Convolution
The main function of the convolutional layer is to identify and extract local combinations of features from the preceding layer (Fig. 1). The actual process of matching is accomplished through filtering in the convolutional layer: A filter bank can be thought of as a small matrix of representative features (of real numbers) for which the number of rows and columns, eg. n× n, of pixels is arbitrarily set. The filter and image patch are lined up, and each image pixel is multiplied (dot product) by the corresponding feature. The result is added up and divided by the total number of pixels in the filter to arrive at a specific feature value. Feature value indicates how well the feature is represented at that position. Sliding over n pixels, the same procedure is repeated for every n× n block of pixels for the entire input image, and a feature map, a “map” of where the filter feature occurs, is obtained. All units in a feature map share the same filter bank. Therefore, the local characteristics of images and other signals remain constant regardless of their location. In simpler terms, a pattern that appears in one part of the image can appear in any other part as well. Hence, the approach of employing units with identical weights to identify corresponding patterns across various sections of the array is adopted. In a convolution layer, filtering can be performed for a bunch of features and create a stack of filtered images. Each feature map in a layer employs its own filter bank. From a mathematical standpoint, the operation of filtering executed by a feature map can be described as a discrete convolution, hence the name.
Fig. 1: Example of a filter (kernel) convolution. Note the new pixel value shown in the figure has not been weighted by the number of windows in the filter.
3.2.3.2. Pooling
The aim of the pooling layer is to reduce the size of a feature map by merging similar features into a single one through the following steps: (1) choose an appropriate window size, usually 2×2 or 3×3 pixels; (2) pick a stride (by how many pixels the window steps down to run through a feature map) accordingly, usually 2 pixels; (3) walk the window by its stride across the filtered images; (4) take the maximum value in each window as the pooling result and form a “pooled map”. A robust motif detection can be accomplished by simplifying the positioning of each feature across all feature maps that are fed into this CNN layer. Pooling helped the algorithm to neglect where in each window the maximum value occurs, making it less sensitive to small shifts in position, either parallel or rotational, an image that strongly fits the filter will still get picked up.
3.2.3.3. Normalization
To keep the math from blowing up, a layer is then passed through a non-linearity such as a ReLU making negative values all 0. This procedure of nonlinear transformation is “normalization” in deep learning terms.
The CNN architecture involves stacking multiple stages of convolution, non-linearity (normalization), and pooling on top of each other, followed by a final fully connected layer (Fig. 2). Each layer’s filter banks in the convolutional layers and voting weights in the fully connected layer are learned through the backpropagation algorithm. In the fully connected layer, also known as the dense layer due to the fact that a large number of neurons are densely connected with each other, a list of feature values becomes a list of votes, when timed by relevant weights that map this layer to the output layer gives the final answer. It is worth noting that this list of votes in the fully connected layer looks a lot alike a list of feature values. Indeed, the output of this layer as intermediate categories can still feed into the input of the next layer, propagating the cycle instead of becoming the final votes.
Fig. 2: Example of a CNN with various types of layers. The convolutional layer does not decrease the size, i.e., the number of pixels, of the input figure, rather it encodes the feature of its input. The pooling layer does decrease the size of its input. The amount of which the size decreases depends on the size of the pooling window and the stride.
3.3. Cross-modal fusion
Cross-modal fusion refers to the process of integrating data from different modalities. PET/CT is a classic example of cross-modal fusion. CT is a type of imaging modality that provides high-resolution, cross-sectional images with excellent clarity and density resolution. PET, on the other hand, is a nuclear imaging technique that generates images showing the spatial distribution of positron-emitting radiopharmaceuticals within the body. With less precise structural details, PET images are well capable of displaying metabolic activity. PET/CT fuses CT with PET, possessing information on both the anatomical details and the metabolic spectrum. As each different information stream possesses unique characteristics, single-modal data often do not contain all the effective features to produce accurate results, whether for data analysis or prediction tasks. Cross-modal deep learning models combine data from two or more different modalities, learning different feature representations from different modalities and facilitating communication and transformation among different information streams, to accomplish specific downstream tasks. This special type of deep learning can improve the accuracy of predictions and enhance the robustness of models.
3.3.1. Cross-modal fusion methods
Cross-modal fusion methods can be categorized into three types: early fusion, late fusion, and hybrid fusion. In early fusion, unimodal features are combined into a single representation before the feature extraction or modeling process[9] . After feature extraction or modeling is performed separately to reduce unimodal features, the outputs are integrated to learn concepts and obtain the final prediction in late fusion[10] . Whereas hybrid fusion combines early and late fusion methods, where fusion is performed at both the feature level
and the output layer [11].
There are various methods of early fusion, including operating on elements at the same position in different modalities. For example, in the field of medical imaging, different imaging modalities can be fused into an integrated image. Nefian et al. proposed a cross-modal early fusion method that used both the factorial and the coupled hidden Markov model for audio-visual integration in speech recognition[12] . Early fusion was done by multiplying the corresponding elements of visual features that capture mouth deformation over consecutive frames and the vector representation, representing the frequency of audio observations, learned by long short-term memory neural networks. A dimensionality reduction was then done on the observation vectors obtained by concatenation of the audio and visual features. Indeed, early fusion methods are often simple in structure with low computational complexity. However, the resulting feature is often high in dimensions, which can impose a significant computational burden on the subsequent model if dimensionality reduction is not performed.
As an example of late fusion, in 2014, Simonyan et al. proposed an architecture that separately inputs spatial and temporal recognition streams of videos, where the spatial stream recognizes actions from still video frames, whilst the temporal stream is in charge of action recognition from motion in the form of dense optical flow[13]. The learned feature outputs are combined by late fusion via either averaging or a linear support vector machine (SVM). As fusion significantly improves on both streams alone, the result proves the complementary nature of inputs spatial and temporal recognition streams and that cross-modal fusion indeed preserves more information of use in the algorithm. Late fusion does not explicitly consider the inter-modality correlation at the feature level, which may result in a lack of interaction among different modalities at the feature level. Consequently, the resulting feature representations after cross-modal fusion may not be rich enough, potentially limiting the effectiveness of the fusion approach.
There is no one optimal solution for all, and the choice of fusion method should be case-by-case.
3.3.2. Cross-modal image translation
Cross-modal image translation has gradually matured in the field of computer vision. Given sufficient training data, deep learning models are capable of learning discriminative features from images of different modalities, and the process of image-to-image translation can be viewed as transforming one potential representation of a scene to another.
In 2017, Isola et al. released a Pix2Pix software that is effective at various image translation tasks, such as synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images[14] . Conventional CNN learns to minimize a loss function that is arbitrary, the making of which takes a lot of manual effort. The Pix2Pix software adopts conditional Generative Adversarial Networks (GANs) that automatically learn the loss function to train the mapping from input to output images, besides learning the mapping itself, as a generic solution to pixel-to-pixel prediction. These networks solve the whole genre of problems that used to require very different loss functions. Conditional GANs differ from other formulation in that it treats output pixels as mutually dependent and thus learns a structured loss, which penalizes the joint configuration of the output. Pix2Pix has good performance on many image translation tasks, but its ability in generating high-resolution images is suboptimal. Wang et al. improved upon Pix2Pix by proposing a new image translation framework for synthesizing high-resolution photo-realistic images from semantic label maps using conditional GANs in 2018. Compared to Pix2Pix, this method has two main improvements: image translation at 2048 × 1024 resolution and semantic editing of images. To generate high-resolution images, their method uses a coarse-to-fine generator, which is composed of a local enhancer for fine high-resolution image conversion and a global generator for coarse low-resolution image conversion respectively, a multi-scale discriminator architecture, and a robust adversarial learning objective function.
Additionally, it adds a low-dimensional feature channel to the input, which allows for the generation of diverse results images based on the same input label map. Zhu et al. proposed the BicycleGAN model combining both the conditional Variational Autoencoder GAN approach and the conditional Latent Regressor GAN approach, based on Pix2Pix in 2017[15] . BicycleGAN is a technique for multi-modal image translation that accomplishes not just the primary objective of mapping the input, together with the latent code to the output, but also concurrently learns an encoder that maps the output back to the latent space. The bijection between the output and the latent space prevents multiple distinct latent codes from producing the same output, also known as non-injective mapping. BicycleGAN allows the generator to model a distribution of high-dimensional output given different encoders, producing diverse and realistic results while remaining faithful to the input.
To accurately transform specific objects among different modalities is the main challenge of cross-modal image translation. Most cross-modal image translation methods require paired data as the input, and due to the scarcity of the paired data, the translated images are often suboptimal or suffer from mode collapse, where the output only represents a limited number of real samples, etc. Therefore, how to achieve high-quality cross-modal image translation with a small amount of paired data is a valuable direction for research.
3.4. The application of cross-modal deep learning
AI is increasingly being studied in metastatic skeletal oncology imaging, and deep learning has been assessed for tasks such as detection, classification, segmentation, and prognosis. Zhao et al. developed a deep neural network-based model to detect bone metastasis on whole-body bone scan (WBS), irrespective of the primary malignancy[16 . Compared to experienced nuclear medicine physicians, the deep learning model not only had a time savings of 99.88% for the same workload, but it also had better diagnostic performance, with improved accuracy and sensitivity. To overcome the constraint of the time-consuming effort required for precise labeling of large datasets, Han et al. proposed a 2D CNN classifier-tandem architecture named GLUE, which integrates whole body and local patches for WBS of prostate cancer patients[17 . The 2D-CNN modeling is the best fit for planar nuclear medicine scans, provided there is a massive amount of training data available. The GLUE model had significantly higher AUCs than a whole-body-based 2D CNN model when the labeled dataset used for training was limited. Noguchi et al. developed a deep learning-based algorithm, with high lesion-based sensitivity and low false positives, to detect bone metastases in CT scans[18 . An observer study was also done to evaluate its clinical efficacy, which showed improved radiologists’ performance when aided by the model, with higher sensitivity, by both lesion-based and case-based analyses, in less amount of interpretation time. Fan et al. used AdaBoost and Chan-Vese algorithms to detect and segment sites of spinal metastasis of lung cancer on MRI images[19 . Chan-Vese algorithm had the best performance. The accuracy of the segmentation, expressed in terms of DSC and Jaccard coefficient scores, were 0.8591 and 0.8002, respectively. Liu et al. built a deep learning model based on 3D U-Net algorithms for the automatic segmentation of pelvic bone and sites of prostate cancer metastases on MRI-DWI and T1-weighted MRI images[20 . The model was found to work best on patients with few metastases, boosting the use of CNN as an aid in M-staging in clinical practice. Multiple deep classifiers were developed by Lin et al. to automatically detect metastases in 251 thoracic SPECT bone images[21 . The performance of the classifiers was found to be excellent, with an AUC of 0.98. Moreau et al. compared different deep learning approaches to segment bones and metastatic lesions in PET/CT images of breast cancer patients[22]. The results indicated that the U-NetBL-based approach for bone segmentation outperformed traditional methods, with a mean DSC of 0.94 ± 0.03, whereas the traditional methods struggled to distinguish metabolically active organs from the bone draft.
Compared to the aforementioned deep learning examples, the more avant-garde cross-modal image fusion and translation techniques have not been widely investigated in bone metastasis imaging. Xu et al. adopted two different convolutional neural networks for lesion segmentation and detection and combined the spatial feature representations extracted from the two different modalities of PET and CT[23] . Their cross-modal method completed the three-dimensional detection of multiple myeloma, outperforming traditional machine learning methods. The research conducted by Wang et al. revealed that texture features extracted from multiparametric prostate MRI before intervention, when combined with clinicopathological risks such as free PSA level, Gleason score, and age, could effectively predict bone metastasis in patients with prostate cancer[24] . The outcome of this study can be seen as a proof of concept for the significance of cross-modal data.
Even though cross-modal investigations regarding the sites of bone metastases are limited by now, there has been plenty of evidence proving the utility of cross-modal fusion in oncological imaging. These applications and trains of thought can be well extrapolated to the field of osseous metastasis imaging. Cross-modal fusion can be applied to tasks such as tumor detection, segmentation, and classification to improve model the performance of deep learning models. Cross-modal image translation can be used for data augmentation to facilitate various downstream tasks.
Cross-modal fusion methods are often employed to enrich the models with cross-modal image features, thus improving the performance of tumor detection. Further, convolutional neural networks are used to capture the relationships between adjacent pixels and extract effective features from the image in deep learning-based cross-modal tumor detection algorithms. In 2021, Huang et al. proposed a ResNet network-based framework, AW3M, that used ultrasonography of four different modalities jointly to diagnose breast cancer[25]. By combining the cross-modal data, the AW3M based upon multi-stream CNN equipped with self-supervised consistency loss was utilized to extract both modality-specific and modality-invariant features, with improved diagnostic performance.
As for tumor segmentation, many researchers hinge on either the four types of MRI image modalities or the two modalities of PET/CT encompassing anatomical and metabolic information to perform cross-modal fusion and improve segmentation performance. For instance, Ma et al. explored CNN-based cross-modal approaches for automated nasopharyngeal carcinoma segmentation[26]. Their proposed multi-modality CNN utilizes CT and MRI to jointly learn a cross-modal similarity metric and fuse complementary features at the output layer to segment paired CT-MR images, demonstrating exceptional performance. Additionally, the study combines the features extracted from each modality’s single-modality CNN and multi-modality CNN to create a combined CNN that capitalizes on the unique characteristics of each modality, thereby improving segmentation performance. Fu et al. introduced a deep learning-based framework for multimodal PET-CT segmentation that leverages PET’s high tumor sensitivity in 2021[27 . Their approach utilized a multimodal spatial attention module to highlight tumor regions and suppress normal regions with physiologic high uptake from PET input. The spatial attention maps generated by the PET-based module were then used to target a U-Net backbone for the segmentation of areas with higher tumor likelihood at different stages from CT images. Results showed that their method surpasses the state-of-the-art lung tumor segmentation approach by 7.6% in the Dice similarity coefficient.
As the diagnostic procedure often requires the integration of multi-modal information, such as chief complaints, physical examinations, medical histories, laboratory tests, and radiology, cross-modal fusion methods are also commonly utilized in disease classification tasks. Cross-modal fusion synthesizes data from different modalities to enrich effective feature representations, enabling deep learning models to extract useful information from different modalities to aid in diagnosis. Zhang et al. proposed a technique for prostate cancer diagnosis using a multi-modal combination of B-mode ultrasonography and sonoelastography[28] . Quantitative features such as intensity statistics, regional percentile features, and texture features were extracted from both modalities, and an integrated deep network was proposed to learn and fuse these multimodal ultrasound imaging features. The final step of disease classification was completed by a support vector machine.
Due to the relative scarcity of medical images, cross-modal image translation is often used to synthesize part of the data in the training set as a data augmentation method for a better-performing deep learning model with a small sample size. Since integrated data from different modalities often exhibit better performance in deep learning models, the multi-modal image data input generated by cross-modal image translation methods can be directly used as targets for tumor detection. A two-step approach for semi-supervised tumor segmentation using MRI and CT images was proposed by Jiang et al[29 . The first step is a tumor-aware unsupervised cross-modal adaptation using a target-specific loss to preserve tumors on synthesized MRIs from CT images. The second step involves training a U-Net model with synthesized and limited original MRIs using semi-supervised learning. Semi-supervised learning is used to boost the accuracy (80%) of tumor segmentation by combining labeled pre-treatment MRI scans with synthesized MRIs, while training with synthesized MRIs had an accuracy of 74%. The proposed approach demonstrated the effectiveness of tumor-aware adversarial cross-modal translation for accurate cancer segmentation from limited imaging data.
In general, there have been bounties of research supporting the application of deep learning in bone metastasis imaging, but the specific application of cross-modal fusion methods is still lacking. Whereas, clinical evaluations regarding bone metastasis often require multi-modal data, such as a chief complaint of lower back pain, a past medical history of pathological fractures, a positive genetic test for specific mutations indicating a higher risk of bone metastasis, or increased blood calcium and alkaline phosphatase concentrations in laboratory reports, etc. Therefore, evaluating osseous lesions with multi-modal data can improve the specificity of diagnosis and reduce the false positive rates in the diagnostic and treatment process. The application of cross-modal deep learning methods in the field of bone metastasis imaging and diagnosis is worth further exploration.
3.5. conclusion
The above review covers the definition and basic principles of deep learning and cross-modal image generation and fusion methods, briefly describes some common cross-modal deep learning algorithms, and summarizes bits of current research on the application of deep learning models in medical imaging, especially bone metastasis imaging. Compared to traditional deep learning models fed with data input of a single modality, multi-modal methods are more recent, with a limited number of relevant research. Given the increasing prevalence of cancer screening and the significant surge in patient-specific clinical data, including radiographs and laboratory tests, it is reasonable to anticipate an unparalleled demand for advanced, intelligent cross-modal deep learning methods in the future. Nevertheless, the use of AI in medical imaging analysis faces various challenges and limitations. These include the need for extensive and diverse datasets for training and validation, the potential for bias and overfitting, as well as the inherent black-box nature of deep learning algorithms[30 . Even though the demand for a large training set reiterates the merit of cross-modal deep learning, which enables the automatic generation of sample images through cross-modal image translation, the size of the training set still has a profound impact on the performance of algorithms. In parallel, the demand for “explainability” has led to the notion of “interpretable machine learning”, utilizing heat maps and metrics to track the focus of deep neural networks[31] . Overall, there is still much to be investigated regarding the application of cross-modal deep learning in the field of medical imaging.
In summary, the project should be founded on the application of cross-modal deep learning techniques to offer practical solutions for challenges encountered in the clinical setting.
参考文献
[1] Dong X, Wu D. A rare cause of peri-esophageal cystic lesion[J] . Gastroenterology, 2023, 164(2): 191-193.[2] Aswathi R R, Jency J, Ramakrishnan B, et al. Classification Based Neural Network Perceptron Modelling with Continuous and Sequential data[J] . Microprocessors and Microsystems, 2022: 104601.
[3] Gardner M W, Dorling S R. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences[J] . Atmospheric environment, 1998, 32(14-15): 2627-2636.
[4] Glorot X, Bordes A, Bengio Y. Deep Sparse Rectifier Neural Networks[J] . Journal of Machine Learning Research, 2011, 15: 315-323.
[5] Bottou L, Bousquet O. The tradeoffs of large scale learning[J] . Advances in Neural Information Processing Systems, 2007, 20: 1-8.
[6] Dauphin, Y. et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization[J] . Advances in neural information processing systems, 2014, 27: 2933–2941.
[7] Hubel D H, Wiesel T N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex[J] . Journal of Physiology, 1962, 160(1): 106-154.
[8] Cadieu C F, Hong H, Yamins D, et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition[J] . Plos Computational Biology, 2014, 10(12): e1003963.
[9] Nefian A V, Liang L, Pi X, et al. Dynamic bayesian networks for audio-visual speech recognition[J] . EURASIP Journal on Advances in Signal Processing, 2002, 2002(11):1-15.
[10] Snoek C G M, Worring M, Smeulders A W M. Early versus late fusion in semantic video analysis[C] . Proceedings of the 13th annual ACM international conference on Multimedia, 2005: 399-402.
[11] Wu Z, Cai L, Meng H. Multi-level fusion of audio and visual features for speaker identification[C] . International Conference on Biometrics Springer, Berlin, Heidelberg, 2005: 493-499.
[12] Nefian A V, Liang L, Pi X, et al. Dynamic Bayesian networks for audio-visual speech recognition[J] . EURASIP Journal on Advances in Signal Processing, 2002, 2002(11):1-15.
[13] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[J] . Advances in neural information processing systems, 2014, 27: 568–576.
[14] Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C] . Proceedings of the IEEE conference on computer vision and pattern recognition, 2017: 1125-1134.
[15] Zhu J Y, Zhang R, Pathak D, et al. Toward multimodal image-to-image translation[J]. Advances in neural information processing systems, 2017, 30: 465–476.
[16] Zhao Z, Pi Y, Jiang L, Xiang Y, Wei J, Yang P, et al. Deep neural network based artificial intelligence assisted diagnosis of bone scintigraphy for cancer bone metastasis[J] . Scientific Reports, 2020, 10(1): 17046.
[17] Han S, Oh J S, Lee J J. Diagnostic performance of deep learning models for detecting bone metastasis on whole-body bone scan in prostate cancer[J] . European journal of nuclear medicine and molecular imaging, 2021, 49(2): 1-11.
[18] Noguchi S, Nishio M, Sakamoto R, Yakami M, Fujimoto K, Emoto Y, et al. Deep learning–based algorithm improved radiologists’ performance in bone metastases detection on CT[J] . European Radiology, 2022, 32(11): 7976-7987.
[19] Fan X, Zhang X, Zhang Z, Jiang Y. Deep learning on MRI images for diagnosis of lung cancer spinal bone metastasis[J] . Contrast Media & Molecular Imaging, 2021, 2021(1): 1-9.
[20] Liu X, Han C, Cui Y, Xie T, Zhang X, Wang X. Detection and segmentation of pelvic bones metastases in MRI images for patients with prostate cancer based on deep learning[J] . Frontiers in Oncology, 2021, 11: 773299.
[21] Lin Q, Li T, Cao C, Cao Y, Man Z, Wang H. Deep learning based automated diagnosis of bone metastases with SPECT thoracic bone images[J] . Scientific Reports, 2021, 11(1): 4223.
[22] Moreau N, Rousseau C, Fourcade C, Santini G, Ferrer L, Lacombe M, et al. Deep learning approaches for bone and bone lesion segmentation on 18FDG PET/CT imaging in the context of metastatic breast cancer[J] . 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society, 2020: 1532-1535.
[23] Xu L, Tetteh G, Lipkova J, et al. Automated whole-body bone lesion detection for multiple myeloma on 68Ga-pentixafor PET/CT imaging using deep learning methods[J] . Contrast media & molecular imaging, 2018, 2018: 2391925.
[24] Wang Y, Yu B, Zhong F, Guo Q, Li K, Hou Y, et al. MRI-based texture analysis of the primary tumor for pre-treatment prediction of bone metastases in prostate cancer[J] . Magnetic Resonance Imaging, 2019, 60: 76-84.
[25] Huang R, Lin Z, Dou H, et al. AW3M: An auto-weighting and recovery framework for breast cancer diagnosis using multi-modal ultrasound[J] . Medical Image Analysis, 2021, 72: 102137.
[26] Ma Z, Zhou S, Wu X, et al. Nasopharyngeal carcinoma segmentation based on enhanced convolutional neural networks using multi-modal metric learning[J] . Physics in Medicine & Biology, 2019, 64(2): 025005.
[27] Fu X, Bi L, Kumar A, et al. Multimodal spatial attention module for targeting multimodal PET-CT lung tumor segmentation[J] . IEEE Journal of Biomedical and Health Informatics, 2021, 25(9): 3507-3516.
[28] Zhang Q, Xiong J, Cai Y, et al. Multimodal feature learning and fusion on B-mode ultrasonography and sonoelastography using point-wise gated deep networks for prostate cancer diagnosis[J] . Biomedical Engineering/Biomedizinische Technik, 2020, 65(1): 87-98.
[29] Jiang J, Hu Y C, Tyagi N, et al. Tumor-aware, adversarial domain adaptation from CT to MRI for lung cancer segmentation[C] . International conference on medical image computing and computer-assisted intervention. Springer, Cham, 2018: 777-785.
[30] Castelvecchi D. Can we open the black box of AI?[J] . Nature, 2016, 538(7623): 20.
[31] Kuang C. Can A.I. Be Taught to Explain Itself?[J] . The New York Times, 2017, 21.
致谢
首先,感谢我的导师,邱贵兴院士,涓涓师恩,铭记于心。感谢吴南师兄,知遇之恩无以报。感谢吴东老师,三生有幸,得您伴我一程风雪。也感谢所有参与此项目的科研合作伙伴和课题组的师兄师弟们,你们的帮助让这项研究得以顺利进行。
感谢在协和遇到的所有老师们,学生朽木,希望未来也能如你们一样,不负一袭白衣。
最后,要感谢我的家人。Wherever I go, this family is my fortress.
这路遥马急的人间,你我平安喜乐就好。