语义融合眼底图像动静脉分类方法
高颖琪1, 郭松1, 李宁1, 王恺1,2, 康宏1,3, 李涛1(1.南开大学计算机学院, 天津 300350;2.天津市医药数据分析与统计研究重点实验室, 天津 300071;3.北京上工医信科技有限公司, 北京 100176) 摘 要
目的 眼底图像中的动静脉分类是许多系统性疾病风险评估的基础步骤。基于传统机器学习的方法操作复杂,且往往依赖于血管提取的结果,不能实现端到端的动静脉分类,而深度语义分割技术的发展使得端到端的动静脉分类成为可能。本文结合深度学习强大的特征提取能力,以提升动静脉分类精度为目的,提出了一种基于语义融合的动静脉分割模型SFU-Net(semantic fusion based U-Net)。方法 针对动静脉分类任务的特殊性,本文采用多标签学习的策略来处理该问题,以降低优化难度。针对动静脉特征的高度相似性,本文以DenseNet-121作为SFU-Net的特征提取器,并提出了语义融合模块以增强特征的判别能力。语义融合模块包含特征融合和通道注意力机制两个操作:1)融合不同尺度的语义特征从而得到更具有判别能力的特征;2)自动筛选出对目标任务更加重要的特征,从而提升性能。针对眼底图像中血管与背景像素之间分布不均衡的问题,本文以focal loss作为目标函数,在解决类别不均衡问题的同时重点优化困难样本。结果 实验结果表明,本文方法的动静脉分类的性能优于现有绝大多数方法。本文方法在DRIVE(digital retinal images for vessel extraction)数据集上的灵敏性(sensitivity)与目前最优方法相比仅有0.61%的差距,特异性(specificity)、准确率(accuracy)和平衡准确率(balanced-accuracy)与目前最优方法相比分别提高了4.25%,2.68%和1.82%;在WIDE数据集上的准确率与目前最优方法相比提升了6.18%。结论 语义融合模块能够有效利用多尺度特征并自动做出特征选择,从而提升性能。本文提出的SFU-Net在动静脉分类任务中表现优异,性能超越了现有绝大多数方法。
关键词
Arteriovenous classification method in fundus images based on semantic fusion
Gao Yingqi1, Guo Song1, Li Ning1, Wang Kai1,2, Kang Hong1,3, Li Tao1(1.College of Computer Science, Nankai University, Tianjin 300350, China;2.Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, Tianjin 300071, China;3.Beijing Shanggong Medical Technology and Development Co. Ltd., Beijing 100176, China) Abstract
Objective Arteriovenous (A/V) classification in fundus images is a fundamental step for the risk assessment of many systemic diseases. A/V classification methods based on traditional machine learning require complicated feature engineering, consistently rely on the results of blood vessel extraction, and cannot achieve end-to-end A/V classification. The development of deep semantic segmentation technology makes end-to-end A/V classification possible, and has been commonly used in fundus image analysis. In this paper, a segmentation model semantic fusion based U-Net (SFU-Net) is proposed combined with the powerful feature extraction capabilities of deep learning to improve the accuracy of A/V classification. Method First, the arteries and veins in the fundus image belong to blood vessels and are highly similar in structure. Existing deep learning-based A/V classification methods frequently treat this problem as a multiclassification problem. This paper proposes a multilabel learning strategy to address this problem for reducing the difficulty of optimization and deal with the situation where the arteries and veins in the fundus image cross. The lower layers of the network are mainly responsible for extracting the common features of the two structures. The upper layers of the network learn two binary classifiers and extract the arteries and veins independently. Second, considering the high similarity of the description features of arteries and veins in color and structure, this paper improves the U-Net architecture in two aspects. 1) The original simple feature extractor of U-Net is replaced by DenseNet-121. The original U-Net encoder is composed of 10 convolutional layers and four maximum pooling layers, and the feature extraction capability is extremely limited. By contrast, DenseNet-121 has many convolutional layers, and the introduction of dense connections makes the feature utilization rate high, the transmission efficiency of features and gradients in the network is high, and the feature extraction ability is strong. This paper reduces four downsampling operations of U-Net to three, and the input image is downsampled by eight times to avoid the loss of detailed information. 2) A semantic fusion module is proposed. The semantic fusion module includes two operations, namely, feature fusion and channelwise attention mechanism. Low-level features have high resolution and contain many location and detail information, but few semantic features and many noises. High-level features have strong semantic information, but their resolution is extremely low and the detail information is few. The features from different layers are first fused to enhance their distinguishing ability. For the fused features, the channelwise attention mechanism is used to select the features. The convolution filter can only capture local information. The global average pooling operation is performed on the input features in the channel dimension to capture global context information. Each element of the resulting vector is a concentrated representation of its corresponding channel. Two nonlinear transformations are then performed on the vector to model the correlation between channels and reduce the amount of parameters and calculations. The vector is restored to its original dimension and normalized to 0-1 through the sigmoid gate. Each element in the obtained vector is regarded as the importance of each channel in the input feature, and each feature channel of the input feature is weighted through a multiplication operation. Through the channel attention mechanism, the network can automatically focus on the feature channels that are important to the task while suppressing the features that are unimportant, thereby improving the performance of the model during the training process. Third, considering the problem of uneven distribution between blood vessels and background pixels in the fundus image, this paper takes focal loss as the loss function to solve the problem of class imbalance and focus on difficult samples at the same time. Focal loss introduces parameters α and γ in the cross-entropy loss function. Parameter α is used to balance the difference between positive and negative samples. Parameter γ adjusts the degree where the loss of simple samples is reduced, thereby amplifying the difference between the loss values of difficult and simple samples. The values of the two parameters are determined through cross-validation. The overall optimization goal is the sum of the focal loss of arteries and veins, thereby optimizing the arteries and veins during training. Result The proposed method is verified on two public datasets, namely, digital retinal images for vessel extraction(DRIVE) and WIDE, and its performance is evaluated from two perspectives, namely, segmentation and classification. Experimental results demonstrate that the proposed method shows better performance than most existing methods. The proposed method achieves an area under the curve of 0.968 6 and 0.973 6 for segmenting arteries and veins on the DRIVE dataset, and the sensitivity, specificity, accuracy and balanced-accuracy of A/V classification are 88.39%, 94.25%, 91.68%, and 91.32%, respectively. Compared with state-of-the-art method, the sensitivity of the proposed method only decreases by 0.61%, and specificity, accuracy, and balaned-auuracy have absolute improvements of 4.25%, 2.68%, and 1.82%, respectively. The proposed method achieves an accuracy of 92.38%, which is 6.18% higher than the state-of-the-art method. Conclusion The fusion module can effectively use multi-scale features and automatically select many important features, thereby improving performance. The proposed method performs well in A/V classification, exceeding most existing methods.
Keywords
|