基于深度学习的医学图像分割方法综述
石军1, 王天同1, 朱子琦2, 赵敏帆1, 王炳勋1, 安虹1(1.中国科学技术大学计算机科学与技术学院;2.中国科学技术大学人工智能与数据科学学院) 摘 要
医学图像分割是临床医学图像分析的重要组成部分,其目标是准确识别和分割医学图像中的人体解剖结构或病灶等感兴趣区域,从而为临床疾病的诊断、治疗规划以及术后评估等应用场景提供客观、量化的决策依据。近年来,随着可用标注数据规模的不断增长,基于深度学习的医学图像分割方法得以迅速发展,并展现出远超传统图像分割方法的精度和鲁棒性,目前已成为该领域的主流技术。为了进一步提高分割精度,大量的研究集中在对分割模型的结构改进上,产生了一系列结构迥异的分割方法。总的来说,现有的基于深度学习的医学图像分割方法从模型结构上可以分为三类:基于卷积神经网络(convolutional neural network, CNN)、基于视觉Transformer以及基于视觉Mamba。其中,以U-Net为代表的基于CNN的方法最早被广泛应用于各类医学图像分割任务。这类方法一般以卷积操作为核心,能够有效地提取图像的局部特征。相比之下,基于视觉Transformer的方法则更擅长捕捉全局信息和长距离依赖关系,从而能够更好地处理复杂的上下文信息。基于视觉Mamba的方法作为一种新兴架构,因其具有全局感受野和线性计算复杂度的特点,表现出了巨大的应用潜力。为了深入了解基于深度学习的医学图像分割方法的发展脉络、优势与不足,本文对现有方法进行了系统梳理和综述。首先,简要回顾了上述三类主流分割方法的结构演进历程,分析了不同方法的结构特点、优势与局限性。然后,从算法结构、学习方法和任务范式等多个方面深入探讨了医学图像分割领域面临的主要挑战及机遇。最后,对基于深度学习的医学图像分割方法未来的发展方向和应用前景进行了深入分析和讨论。
关键词
A review of deep learning-based medical image segmentation methods
Shi Jun, Wang Tiantong1, Zhu Ziqi2, Zhao Minfan1, Wang Bingxun1, An Hong1(1.School of Computer Science and Technology,University of Science and Technology of China;2.School of Artificial Intelligence and Data Science,University of Science and Technology of China) Abstract
Medical image segmentation is a crucial component of clinical medical image analysis, aimed at accurately identifying and delineating anatomical structures or regions of interest, such as lesions, within medical images. This provides objective and quantitative support for decision-making in disease diagnosis, treatment planning, and postoperative evaluation. In recent years, the rapid growth of available annotated data has facilitated the swift development of deep learning-based medical image segmentation methods, which demonstrate superior accuracy and robustness compared to traditional segmentation techniques, thereby becoming the mainstream technology in the field. To further enhance segmentation accuracy, extensive research has focused on improving the structural designs of segmentation models, resulting in a variety of distinct segmentation approaches. Current deep learning-based medical image segmentation methods can be classified into three main structural categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Vision Mamba.
As a representative neural network architecture, CNNs effectively capture spatial features in images through their unique local receptive fields and weight-sharing mechanisms, making them particularly suitable for image analysis and processing tasks. Since 2015, CNN-based methods, exemplified by U-Net, have dominated the field of medical image segmentation, consistently achieving state-of-the-art performance across various downstream segmentation tasks. To further improve segmentation accuracy, many studies have focused on modifying and innovating the U-Net structure, leading to a series of derived segmentation methods. However, the inherent limitations of convolutional operators, particularly their local receptive fields, restrict these methods' ability to capture global contextual dependencies, especially when handling complex medical images and fine-grained segmentation targets. While techniques such as attention mechanisms and specialized convolutions have somewhat alleviated this issue and enhanced the model's focus on global information, their effectiveness remains limited.
Since 2020, researchers have begun to introduce Transformer architectures, originally developed in the natural language processing (NLP) domain, into computer vision tasks, including medical image segmentation. Vision Transformers utilize self-attention mechanisms to effectively model global dependencies, significantly improving the quality of semantic feature extraction and facilitating the segmentation of complex medical images. Transformer-based methods for medical image segmentation mainly include hybrid approaches that combine Transformers with CNNs and pure Transformer methods, each showcasing unique advantages and disadvantages. Hybrid approaches leverage CNNs' strengths in local feature extraction alongside Transformers' capabilities in modeling global context, thereby enhancing segmentation accuracy while maintaining computational efficiency. However, these methods remain dependent on CNN structures, which may limit their performance in complex scenarios. In contrast, pure Transformer methods excel in capturing long-range dependencies and multiscale features, significantly improving segmentation accuracy and generalization. Nevertheless, pure Transformer architecture typically requires substantial computational resources and high-quality training data, posing challenges in obtaining large-scale annotated datasets in the medical field.
Despite the notable advantages of Transformer structures in capturing long-range dependencies and global contextual information, their computational complexity grows quadratically with the length of the input sequence, limiting their applicability in resource-constrained environments. To overcome this challenge, researchers are developing new methods capable of modeling global dependencies with linear time complexity. Mamba introduces a novel selective state-space model that employs a selection mechanism, hardware-aware algorithms, and a simpler architecture to reduce computational complexity while maintaining efficient long-sequence modeling performance significantly. Consequently, since 2024, numerous studies have begun to apply the Mamba structure to medical image segmentation tasks, achieving promising results and potentially replacing Transformer structures. The hybrid method combining Mamba with CNNs can more effectively enhance segmentation accuracy and robustness by integrating CNN's feature extraction capabilities with Mamba's handling of long-range dependencies. However, this approach may increase computational complexity during integration. Additionally, pure Mamba methods are more suitable for segmentation tasks requiring global contextual information but still face limitations in capturing spatial features of images and may demand greater computational resources during training.
In summary, this paper systematically reviews and analyzes the development trajectory, advantages, and limitations of deep learning-based medical image segmentation methods from a structural perspective for the first time. First, we categorize all surveyed methods into three structural classes. We then provide a brief overview of the structural evolution of different segmentation methods, analyzing their structural characteristics, strengths, and weaknesses. Subsequently, we delve into the major challenges and opportunities currently facing the field of medical image segmentation from multiple perspectives, including algorithm structure, learning methods, and task paradigms. Finally, we conduct an in-depth analysis and discussion of future development directions and application prospects.
Keywords
deep learning medical image segmentation convolutional neural network vision transformer vision mamba
|