上下文信息和多尺度特征序列引导的遥感图像检测
肖振久1, 李士博1, 曲海成1, 李富坤2(1.辽宁工程技术大学;2.河南师范大学) 摘 要
目的 针对遥感图像(remote sensing image,RSI)检测中目标尺寸小且密集、尺度变化大,尤其在复杂背景信息下容易出现漏检和误检问题,提出一种上下文信息和多尺度特征序列引导的遥感图像检测方法,以提升遥感图像的检测精度。方法 首先,设计自适应大感受野机制(adaptive large receptive field,ALRF)用于特征提取。该机制通过级联不同扩张率的深度卷积进行分层特征提取,并利用通道和空间注意力对提取的特征进行通道加权和空间融合,使模型能够自适应地调整感受野大小,从而实现遥感图像上下文信息的有效利用。其次,为解决颈部网络特征融合过程中小目标语义信息丢失问题,设计多尺度特征序列融合架构(multi-scale feature fusion,MFF)。该架构通过构建多尺度特征序列,并结合浅层语义特征信息,实现复杂背景下多尺度全局信息的有效融合,从而减轻深层网络中特征模糊性对小目标局部细节捕捉的影响。最后,因传统交并比(IoU,intersection over union)对小目标位置偏差过于敏感,引入归一化Wasserstein距离(normalized Wasserstein distance,NWD)。NWD将边界框建模为二维高斯分布,计算这些分布间的Wasserstein距离来衡量边界框的相似性,从而降低小目标位置偏差敏感性。结果在NWPU VHR-10(Northwestern Polytechnical University very high resolution 10 dataset)和DIOR(dataset for object detection in aerial images)数据集上与10种方法进行综合比较,结果表明,提出的方法优于对比方法,其中,相较于基准模型YOLOv8n,平均精度(average precision,AP)分别达到93.15%和80.89%,提升5.48%和2.97%,同时参数量下降6.96%。结论 提出一种上下文信息和多尺度特征序列引导的遥感图像检测方法,该方法提升目标的定位能力,改善复杂背景下遥感图像检测中的漏检和误检问题。
关键词
Remote sensing image detection guided by context information and multi-scale feature sequences
(Liaoning Technical University) Abstract
Objective Remote sensing images (RSI) have been widely used in environmental monitoring, resource management, and emergency response due to their wide coverage, rich information content, and high temporal and spatial resolution. However, in the actual object detection process, the objects in remote sensing images are usually small in size, densely distributed, and have large-scale variations, which can easily lead to missed detection and false detection problems, especially in complex background. To address these challenges, we propose a remote sensing image detection method guided by contextual information and multi-scale feature sequences. Method Firstly, in the feature extraction stage, we designed an adaptive large receptive field (ALRF). In remote sensing image processing, the design of a reasonable receptive field (RF) is crucial. A large receptive field can not only effectively capture the global structure of large-sized targets, but also use contextual information to capture the details of small targets, thereby improving detection performance. Large kernel convolution is a common method to expand the receptive field, but its computational cost is high. As the size of the convolution kernel increases, the number of parameters will also increase rapidly, affecting the inference speed and efficiency of the model. To solve this problem, ALRF adopts multi-layer cascade dilated convolution to expand the receptive field by gradually increasing the size and dilation rate of the convolution kernel, while maintaining a low number of parameters, achieving a receptive field and context perception capability comparable to that of a large kernel convolution. For the hierarchical features extracted by cascade dilated convolution, ALRF performs channel weighing through channel attention and then spatial fusion through spatial attention, thereby adaptively adjusting the receptive field size according to the features of different targets, effectively capturing the contextual information in remote sensing images and realizing the extraction of multi-scale features. Secondly, in the feature fusion stage, we proposed multi-scale feature fusion (MFF) architecture. MFF is mainly composed of a focal scale fusion engine (FSFE) and a fine-scale feature encoder (FFE) and fuses the multi-scale feature sequences extracted from the backbone network through a bidirectional feature pyramid network (BiFPN) structure. FSFE constructs a multi-scale feature sequence using Gaussian convolution kernels with increasing standard deviations. FFE, on the other hand, introduces detailed information from the shallow feature layer of P2 and uses hybrid coding to enhance semantic expression. MFF achieves effective fusion of multi-scale global information under complex background information through the collaborative work of FSFE and FFE, thereby reducing the impact of feature ambiguity in deep networks on capturing small target details. Finally, we introduce the normalized Wasserstein distance (NWD) to replace the traditional intersection over union (IoU). NWD measures the similarity of bounding boxes by modeling them as two-dimensional Gaussian distributions and calculating the Wasserstein distance between these distributions. As a more robust similarity metric, NWD can more accurately evaluate the alignment between the bounding box of a tiny object and the ground truth box, thereby reducing sensitivity to positional deviations. Result Our method is experimentally verified on two datasets, NWPU VHR-10 (Northwestern Polytechnical University very high resolution 10 dataset) and DIOR (dataset for object detection in aerial images), and compared with ten detection methods from the two-stage, one-stage, and DETR series. The experimental results show that the method has significant advantages in small target detection and complex backgrounds. On the NWPU VHR-10 dataset, compared with the baseline model YOLOv8n, the average precision (AP) of the method is improved by 5.48% to 93.15%, while the number of parameters is reduced by 6.96%. In the DIOR dataset, the AP of the method is 80.89%, which is 2.97% higher than that of YOLOv8n. The effectiveness of ALRF, MFF, and NWD is verified through ablation experiments. ALRF achieves efficient use of contextual information through cascaded dilated convolutions; MFF enhances the ability to capture small target details by constructing multi-scale feature sequences and introducing shallow features; NWD improves the balance of positive and negative sample distribution by modeling the bounding box as a two-dimensional Gaussian distribution. The use of ALRF, MFF and NWD together can achieve the best effect on the model, with precision, recall and AP increased by 2.76%, 4.61% and 5.48% respectively compared with the baseline model. In addition, the PR (precision-recall) curve of this method is closer to the upper right and more stable, indicating that it is stable in the detection of different categories of targets. At the same time, this method is effective in reducing false detections and missed detections, and shows stronger generalization and robustness compared with other comparative methods. Conclusion A remote sensing image detection method guided by context information and multi-scale feature sequences is proposed to solve the problems of dense small targets, large target size variations and complex background in remote sensing image detection. This method first uses the adaptive large receptive field (ALRF) to effectively utilize context information. Then the multi-scale feature fusion (MFF) is used to reduce the semantic information loss of small targets. Finally, the normalized Wasserstein distance (NWD) is introduced to replace the traditional IoU to optimize the distribution of positive and negative samples. Experimental results show that on the NWPU VHR-10 and DIOR datasets, the average precision (AP) of this method reaches 93.15% and 80.89% respectively, and the model parameters are only 2.94 M, which is significantly better than the comparison method. In future work, it will be extended to target detection under multi-view imaging conditions, and a lightweight model suitable for real-time detection applications will be developed.
Keywords
Remote sensing images target detection receptive field feature fusion normalized Wasserstein distance
|