面向360度全景图像显著目标检测的畸变语义聚合网络
陈晓雷, 张学功, 杜泽龙, 王兴(兰州理工大学) 摘 要
摘 要 :目的 为了有效应对360°全景图像的几何畸变和大视野特性带来的挑战,本文提出了一种畸变自适应语义聚合网络 (distortion semantic aggregation network,DSANet)。该网络能够提升360°全景图像显著目标检测性能。方法 DSANet由三个模块组成:畸变自适应校正模块 (distortion aware calibration module,DACM)、多尺度语义注意力聚合模块 (multiscale semantic attention aggregation module,MSAAM)、以及渐进式细化模块 (progressive refinement module, PRM)。DACM模块利用不同扩张率的可变形卷积来学习自适应权重矩阵,校正360°全景图像中的几何畸变。MSAAM模块结合注意力机制和可变形卷积,提取并融合全局语义特征与局部细节特征,生成多尺度语义特征。PRM模块逐层融合多尺度语义特征,进一步提升检测精度。MSAAM模块与PRM模块相配合,解决360°全景图像的大视野问题。结果 在两个公开数据集360-SOD 和360-SSOD (共计1605张图像) 上进行的实验表明,DSANet在6种主流评价指标上(包括Max F-measure、Mean F-measure、MAE(mean absolute error)、Max E-measure、Mean E-measure、Structure-measure)均优于其他方法。结论 本文提出的方法在多个客观评价指标上表现突出,同时生成的显著目标图像在边缘轮廓性和空间结构细节信息上更为清晰。
关键词
Distortion semantic aggregation network for salient object detection in 360 degree omnidirectional images
ChenXiaolei, ZhangXuegong, DuZelong, WangXing(Lanzhou University of Technology) Abstract
Abstract: Objective By integrating corrected images with spatial and semantic information, salient object detection using omnidirectional image data significantly outperforms single-modality detection in terms of prediction accuracy. The emergence of deep learning techniques has further accelerated advancements in salient object detection for omnidirectional images. However, existing models in this area fundamentally neglect the challenges of image distortion and the distinct characteristics of different modalities. They commonly rely on simplistic fusion methods such as element addition, multiplication, or concatenation, which fails to foster meaningful interactions between the modalities of omnidirectional images. This approach does not effectively utilize complementary information or exploit the potential correlations among them. To rectify this deficiency, it is imperative to develop more robust methods that enhance information interaction across different modalities of omnidirectional images, leading to superior salient object detection results. Researchers have successfully designed a distortion semantic aggregation network (DSANet). This innovative method applied to salient object detection in omnidirectional images for the first time, rigorously analyzes the correlations between various modalities and leverages these relationships to optimize the fusion and interaction processes. Methods DSANet consists of three modules: the distortion adaptive correction module (DACM), the multi-scale semantic attention aggregation module (MSAAM), and the progressive refinement module(PRM). First, there is an aberration problem in transforming omnidirectional images into equal rectangular projection (ERP) images, and the aberration partially affects the accuracy of target detection. Therefore, this paper designs the DACM module to adaptively correct the distortion features of the input ERP image to improve detection accuracy. The mechanism of the DACM module learns the adaptive weight matrix by using deformable convolutions with different dilatation rates for the input ERP image to correct the geometric distortions in the 360° omnidirectional image. Secondly, the corrected ERP image is fed into the ResNet-50 encoder for feature extraction. Convolutional neural networks can obtain abstract semantic features through multiple convolution and pooling operations, but the diversity of input images brings significant target scale and location uncertainty, and the theoretical receptive field is often somewhat different from the actual receptive field, which results in the network not being able to effectively extract global semantic features. Therefore, if single-scale deep features are used directly, the correct salient targets may not be obtained due to the lack of semantic features, while the combination of global semantic features after channel attention and local detailed features after spatial attention can solve the problem. Based on this idea, this paper designs a multiscale semantic attention aggregation module (MSAAM), which is mainly composed of two sub-modules: global semantic features and local detail features. Its mechanism is to input the features extracted by the encoder into the channel attention module to extract the channel weight values through convolutional processing, and then multiply them by pixels with the input features to obtain the global semantic features, and input the global semantic features into the spatial attention module to obtain the spatial weight values and multiply them by the input feature elements to obtain the detailed features, and then merge the global semantic features with the local detailed features to generate the multiscale semantic features. PRM module fuses the multi-scale semantic features layer by layer to further improve the detection accuracy. The MSAAM module cooperates with the PRM module to solve the problem of a large field of view in 360° omnidirectional images. Finally, due to the contradiction between different feature layers in the fusion process, the saliency map generated by the MSAAM module is unclear and internally mutilated. To obtain more accurate saliency maps, this paper designs a progressive refinement module (PRM) in the decoder, which fuses the multi-scale semantic feature maps from the encoder layer by layer from deep to shallow maps, to further refine and enhance the feature maps at each stage. Its mechanism of action up-samples the high-level features and fuses them with the low-level features for elemental multiplication and then multiplies and sums them with each of the high-level and low-level features to obtain the coarse-level features. The coarse-level features are activated to obtain the weight values and inverted, and then multiplied with the coarse-level features and then summed to obtain the fine-level features, and then aggregated from the high level to the low level to obtain the output features. Results Experiments on two publicly available datasets,360-SOD and 360-SSOD(total of 1605 images), show that DSANet outperforms the other methods in six mainstream evaluation metrics(including Max F-measure, Mean F-measure, MAE(mean absolute error), Max E-measure, Mean E-measure, and Structure-measure). Conclusion The method proposed in this paper excels in several objective evaluation metrics, while generating salient object images that are clearer in terms of edge contouring and spatial structural detail information.
Keywords
Deep learning salient object detection 360° omnidirectional images geometric distortion,large-scale vision
|