Current Issue Cover
结合空间注意力多层特征融合显著性检测

陈凯, 王永雄(上海理工大学光电信息与计算机工程学院, 上海 200093)

摘 要
目的 多层特征对于显著性检测具有重要作用,多层特征的提取和融合是显著性检测研究的重要方向之一。针对现有的多层特征提取中忽略了特征融合与传递、对背景干扰信息敏感等问题,本文基于特征金字塔网络和注意力机制提出一种结合空间注意力的多层特征融合显著性检测模型,该模型用简单的网络结构较好地实现了多层特征的融合与传递。方法 为了提高特征融合质量,设计了多层次的特征融合模块,通过不同尺度的池化和卷积优化高层特征和低层特征的融合与传递过程。为了减少低层特征中的背景等噪声干扰,设计了空间注意力模块,利用不同尺度的池化和卷积从高层特征获得空间注意力图,通过注意力图为低层特征补充全局语义信息,突出低层特征的前景并抑制背景干扰。结果 本文在DUTS,DUT-OMRON(Dalian University of Technology and OMRON Corporation),HKU-IS和ECSSD(extended complex scene saliency dataset) 4个公开数据集上对比了9种相关的主流显著性检测方法,在DUTS-test数据集中相对于性能第2的模型,本文方法的最大F值(MaxF)提高了1.04%,平均绝对误差(mean absolute error,MAE)下降了4.35%,准确率—召回率(precision-recall,PR)曲线、结构性度量(S-measure)等评价指标也均优于对比方法,得到的显著图更接近真值图,同时模型也有着不错的速度表现。结论 本文用简单的网络结构较好地实现了多层次特征的融合,特征融合模块提高了特征融合与传递质量,空间注意力模块实现了有效的特征选择,突出了显著区域、减少了背景噪声的干扰。大量的实验表明了模型的综合性能以及各个模块的有效性。
关键词
Saliency detection based on multi-level features and spatial attention

Chen Kai, Wang Yongxiong(School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China)

Abstract
Objective In contrast with semantic segmentation and edge detection, saliency detection focuses on finding the most attractive target in an image. Saliency maps can be widely used as a preprocessing step in various computer vision tasks, such as image retrieval, image segmentation, object recognition, object detection, and visual tracking. In computer graphics, a map scan is used in non-photorealistic rendering, automatic image cropping, video summarization, and image retargeting. Early saliency detection methods mostly measure the salient score through basic characteristics, such as color, texture, and contrast. Although considerable progress has been achieved, handcrafted features typically lack global information and tend to highlight the edges of salient targets rather than the overall area to describe complex scenes and structures. Given the development of deep learning, the introduction of convolutional neural networks frees saliency detection from the restraint of traditional handcrafted features and achieves the best results at present. Fully convolutional networks (FCNs) stack convolution and pooling layers to obtain global semantic information. Spatial structure information may be lost and the edge information of saliency targets may be destroyed when we increase the receptive field to obtain global semantic features. Thus, the FCN cannot satisfy the requirement of a complex saliency detection task. To obtain accurate saliency maps, some studies have attempted to introduce handcrafted features to retain the edge of a saliency target and obtain the final saliency maps by combining the extracted edge’s handcrafted features with the higher-level features of the FCN. However, the extraction of handcrafted features takes considerable time. Details may be gradually lost in the process of transforming features from low level to high level. Some studies have achieved good results; they combine high- and low-level features and use low-level features to enrich the details of high-level features. Many models based on multilevel feature fusion have been proposed in recent years, including multi flow, side fusion, bottom-up, and top-down structures. These models focus on network structures and disregard the importance of transmission and the difference between high- and low-level features. This condition may cause the loss of the global semantic information of high-level features and increase the interference of low-level features. Multilevel features play an important role in saliency detection. The method of multilevel feature extraction and fusion is one of the important research directions in saliency detection. To solve the problems of feature fusion and sensitivity to background interference, this study proposes a new saliency detection method based on feature pyramid networks and spatial attention. This method achieves the fusion and transmission of multilevel features with simple network architecture. Method We propose a multilevel feature fusion network architecture based on a feature pyramid network and spatial attention to integrate different levels of features. The proposed architecture adopts the feature pyramid network, which is the classic bottom-up and top-down structure, as the backbone network and focuses on the optimization of multilevel feature fusion and the transmission process. The network proposed in this work consists of two parts. The first part is the bottom-up convolution part, which is used to extract features. The second part is the top-down upsampling part. Each upsampling of high-level features will be fused with the low-level features of the corresponding scale and transmitted forward. The feature pyramid network removes the high-resolution feature before the first pooling to reduce computation. Multilevel features are extracted using visual geometry group (VGG)-16, which is one of the most excellent feature extraction networks. To improve the quality of feature fusion, a multilevel feature fusion module that optimizes the fusion and transmission processes of high-level features and various low-level features through the pooling and convolution of different scales is designed. To reduce the background interference of low-level features, a spatial attention module that supplies global semantic information for low-level features through attention maps obtained from high-level features via the pooling and convolution of different scales is designed. These attention maps can assist low-level features to highlight the foreground and suppress the background. Result The experimental results show that the saliency maps obtained using the proposed method are highly similar to the ground truth maps in four standard datasets, namely, DUTS, DUT-OMRON(Dalian University of Technology and OMRON Corporation), HKU-IS, and extended complex scene saliency dataset(ECSSD), the max F-measure MaxF increased by 1.04%, and mean absolute error (MAE) decreased by 4.35% compared with the second in the DUTS-test dataset. The method proposed in this study performs the best in simple or complex scenes. The network exhibits good feature fusion and edge learning abilities, which can effectively suppress the background of salient areas and fuse the details of low-level features. The saliency maps from our method have more complete salient areas and clearer edges. The results in terms of four common evaluation indexes are better than those obtained by nine state-of-the-art methods. Conclusion In this study, the fusion of multilevel features is realized well by using a simple network structure. The multilevel feature fusion module can retain the location information of saliency targets and improve the quality of feature fusion and transmission. The spatial attention module reduces the background details and makes the saliency areas more complete. This module realizes feature selection and avoids the interference of background noise. Many experiments have proven the performance of the model and the effectiveness of each module proposed in this work.
Keywords

订阅号|日报