双向特征融合的数据自适应SAR图像舰船目标检测模型
摘 要
目的 利用合成孔径雷达(synthetic aperture radar,SAR)图像进行舰船目标检测是实施海洋监视的重要手段。基于深度学习的目标检测模型在自然图像目标检测任务中取得了巨大成功,但由于自然图像与SAR图像的差异,不能将其直接迁移到SAR图像目标检测中。针对SAR图像目标检测实际应用中对速度和精度的需求,借鉴经典的单阶段目标检测模型(single shot detector,SSD)框架,提出一种基于特征优化的轻量化SAR图像舰船目标检测网络。方法 改进模型并精简网络结构,提出一种数据驱动的目标分布聚类算法,学习SAR数据集的目标尺度、长宽比分布特性,用于网络参数设定;对卷积神经网络(convolutional neural network,CNN)提取的特征进行优化,提出一种双向高低层特征融合机制,将高层特征的语义信息通过语义聚合模块加成到低层特征中,在低层特征中提取特征平均图,处理后作为高层特征的注意力权重图对高层特征进行逐像素加权,将低层特征丰富的空间信息融入到高层特征中。结果 利用公开的SAR舰船目标检测数据集(SAR ship detection dataset,SSDD)进行实验,与原始的SSD模型相比,轻量化结构设计在不损失检测精度的前提下,样本测试时间仅为SSD的65%;双向特征融合机制将平均精确度(average precision,AP)值由77.93%提升至80.13%,训练和测试时间分别为SSD的64.1%和72.6%;与公开的基于深度学习的SAR舰船目标检测方法相比,本文方法在速度和精度上都取得了最佳性能,AP值较精度次优模型提升了1.23%,训练和测试时间较精度次优模型分别提升了559.34 ms和175.35 ms。结论 实验充分验证了本文所提模型的有效性,本文模型兼具检测速度与精度优势,具有很强的实用性。
关键词
Data-adaptive single-shot ship detector with a bidirectional feature fusion module for SAR images
Zhang Xiaohan1, Yao Libo1, Lyu Yafei1, Jian Tao1, Zhao Zhiwei2, Zang Jie2(1.Information Fusion Institute, Naval Aviation University, Yantai 264001, China;2.Beijing Space Vehicle General Design Department, Beijing 100089, China) Abstract
Objective Ship detection plays an important role in civil and military fields, including marine object identification, maritime transportation, rescue operation, marine security, and disaster relief. As a basic means of marine monitoring, ship detection in synthetic aperture radar (SAR) images has been studied for years. With the development of sensor and platform technologies, SAR big data are achieved, making it possible to perform automatic data-driven detection algorithms. Deep learning-based detection models have been proven to be a great success in common object detection tasks for natural scene images; moreover, it outperforms many traditional artificial feature based methods. However, when transferring them to SAR ship detection directly, many challenges emerge, and the results are not satisfying because natural and SAR images have several differences. Ship in SAR images usually appear as some bright parts and lack detail information in comparison with natural images because of the coherent imaging mechanism. The swath of SAR remote sensing images is large, but targets are distributed densely or sparsely; thus, the processing of SAR images is usually more complex than that of natural ones. In addition, the size and shape of ship targets vary, ranging from several pixels to hundreds. All these factors complicate ship detection in SAR images. Aiming to solve these challenges and considering the actual demands in practice, this study proposes a lightweight data-adaptive detector with a feature-optimizing mechanism on the basis of the famous single-shot detector (SSD) to improve detection precision and speed. Method In this study, the original SSD is modified by having the number of channels halved and the last two convolution blocks removed. The settings of the network parameters follow the outputs of proposed data-driven target distribution clustering algorithm, which leans the distributions of targets in the SAR dataset, including the size of ships and the aspect ratio of ships. The algorithm is free from human experience and can make the detector adapt to the SAR dataset. Trunked visual geometry group 16-layer net (VGG16) is utilized to extract features from input SAR images. Given that the features extracted by convolutional neural networks are hierarchical, low-level features with high spatial resolution usually contain extra local and spatial detail information, whereas more semantic and global information are involved in high-level features with low resolution. For object detection tasks, spatial and sematic information are important. Thus, information must be aggregated through a fusion strategy. A new bidirectional feature fusion mechanism, which contains a semantic aggregation and a novel attention guidance module, is proposed. In feature pyramid networks, the higher features are added to the lower features after an upsampling operation. On this basis, the up-sampled higher features in our model are concatenated with lower features in the channel dimension, and the channel numbers are adjusted through a 1×1 convolution operation. Instead of simply adding lower features to higher features, an inverse fusion from down to top and an attention mechanism are applied. A spatial attention map of each convolution block is generated, and the attention map that contains the most spatial information is selected as a weighted map. In the weighted map, target pixels with higher value are usually more noticeable, whereas the value of background pixels are suppressed. After down sample to weight map, element-wise multiplication is performed between the weighted map and the higher features. The features of the targets are strengthened; thus, spatial information is passed to higher level features. The optimized features are then entered into detector heads to predict the locations and types of targets; the low-level features mainly detect the small ships, whereas the high-level features are responsible for the large ones. The entire network is trained by a weighted sum of location and classification losses. In interference, nonmaximum suppression is used for removing repeated bounding boxes. Result The public SAR ship detection dataset widely used in SAR ship detection references is adopted in experiments. All the experiments are implemented using Python language under the TensorFlow framework on a 64-bit computer with Ubuntu 16.06, CPU Intel (R) Core (TM) i7-6770K @4.00 GHz×8, and NVIDIA GTX 1080Ti with CUDA9.0 and cuDNN7.0 for acceleration. The training iteration, initial learning rate, and batch size are set as 120 k, 0.000 1, and 24, respectively. A momentum optimizer is used, with weight decay, gamma, and momentum values of 0.000 5, 0.1, and 0.9, respectively. An ablation study is operated to verify the effectiveness of each proposed module, and the model is compared with five published state-of-art methods. Precision rate, recall rate, average precision (AP), and the average training and testing time on a single image, are taken as evaluation indicators. In the original SSD, a model with parameters from the proposed data-driven target distribution clustering algorithm improves the AP by 1.08% in comparison with the model with original parameters. The lightweight design of the network significantly improves the detection speed; compared with that of the SSD, the training and testing time of the proposed model decrease from 20.79 ms to 12.74 ms and from 14.02 ms to 9.17 ms, respectively. The semantic aggregation and attention fusing modules can improve detection precision, whereas when the two modules are used together, the optimum performance in detection precision is achieved. The AP increased from 77.93% to 80.13%, and the precision and recall rates increased from 89.54% to 96.68% and from 88.60% to 89.60%, respectively. However, speed is not considerably affected, and the model still runs faster than SSD. The proposed model outperforms other models in terms of precision and speed; moreover, it improves AP by 6.9%, 1.23%, 9.09%, and 2.9% in comparison with other four methods. Conclusion In this study, we proposed a lightweight data adaptive single shot detector with feature optimizing mechanism. Experiment results show that our model have remarkable advantages over other published state-of-the-art detection approaches in terms of precision and speed.
Keywords
|