Current Issue Cover
多尺度卷积神经网络显著物体检测

张晴, 左保川, 石艳娇, 戴蒙(上海应用技术大学计算机科学与信息工程学院, 上海 201418)

摘 要
目的 传统显著性检测模型大多利用手工选择的中低层特征和先验信息进行物体检测,其准确率和召回率较低,随着深度卷积神经网络的兴起,显著性检测得以快速发展。然而,现有显著性方法仍存在共性缺点,难以在复杂图像中均匀地突显整个物体的明确边界和内部区域,主要原因是缺乏足够且丰富的特征用于检测。方法 在VGG(visual geometry group)模型的基础上进行改进,去掉最后的全连接层,采用跳层连接的方式用于像素级别的显著性预测,可以有效结合来自卷积神经网络不同卷积层的多尺度信息。此外,它能够在数据驱动的框架中结合高级语义信息和低层细节信息。为了有效地保留物体边界和内部区域的统一,采用全连接的条件随机场(conditional random field,CRF)模型对得到的显著性特征图进行调整。结果 本文在6个广泛使用的公开数据集DUT-OMRON(Dalian University of Technology and OMRON Corporation)、ECSSD(extended complex scene saliency dataset)、SED2(segmentation evalution database 2)、HKU、PASCAL-S和SOD(salient objects dataset)上进行了测试,并就准确率—召回率(precision-recall,PR)曲线、F测度值(F-measure)、最大F测度值、加权F测度值和均方误差(mean absolute error,MAE)等性能评估指标与14种最先进且具有代表性的方法进行比较。结果显示,本文方法在6个数据集上的F测度值分别为0.696、0.876、0.797、0.868、0.772和0.785;最大F测度值分别为0.747、0.899、0.859、0.889、0.814和0.833;加权F测度值分别为0.656、0.854、0.772、0.844、0.732和0.762;MAE值分别为0.074、0.061、0.093、0.049、0.099和0.124。无论是前景和背景颜色相似的图像集,还是多物体的复杂图像集,本文方法的各项性能均接近最新研究成果,且优于大多数具有代表性的方法。结论 本文方法对各种场景的图像显著性检测都具有较强的鲁棒性,同时可以使显著性物体的边界和内部区域更均匀,检测结果更准确。
关键词
A multi-scale convolutional neural network for salient object detection

Zhang Qing, Zuo Baochuan, Shi Yanjiao, Dai Meng(School of Computer Science and Information Engineering, Shanghai Institute of Technology, Shanghai 201418, China)

Abstract
Objective Salient object detection aims to localize and segment the most conspicuous and eye-attracting objects or regions in an image. Its results are usually expressed by saliency maps, in which the intensity of each pixel presents the strength of the probability that the pixel belongs to a salient region. Visual saliency detection has been used as a pre-processing step to facilitate a wide range of vision applications, including image and video compression, image retargeting, visual tracking, and robot navigation. Traditional saliency detection models focus on handcrafted features and prior information for detection, such as background prior, center prior, and contrast prior. However, these models are less applicable to a wide range of problems in practice. For example, salient objects are difficult to recognize when the background and salient objects share similar visual attributes. Moreover, failure may occur when multiple salient objects overlap partly or entirely with one another. With the rise of deep convolutional neural networks (CNNs), visual saliency detection has achieved rapid progress in the recent years. It has been successful in overcoming the disadvantages of handcrafted-feature-based approaches and greatly enhancing the performance of saliency detection. These CNNs-based models have shown their superiority on feature extraction. They also efficiently capture high-level information on the objects and their cluttered surroundings, thus achieving better performance compared with the traditional methods, especially the emergence of fully convolutional networks (FCN). Most mainstream saliency detection algorithms are now based on FCN. The FCN model unifies the two stages of feature extraction and saliency calculation and optimizes it through supervised learning. As a result, the features extracted by FCN network have stronger advantages in expression and robustness than do handcrafted features. However, existing saliency approaches share common drawbacks, such as difficulties in uniformly highlighting the entire salient objects with explicit boundaries and heterogeneous regions in complex images. This drawback is largely due to the lack of sufficient and rich features for detecting salient objects. Method In this study, we propose a simple but efficient CNN for pixel-wise saliency prediction to capture various features simultaneously. It also utilizes ulti-scale information from different convolutional layers of a CNN. To design a FCN-like network that is capable of carrying out the task of pixel-level saliency inference, we develop a multi-scale deep CNN for discovering more information in saliency computation. The multi-scale feature extraction network generates feature maps with different resolution from different side outputs of convolutional layer groups of a base network. The shallow convolutional layers contain rich detailed structure information at the expense of global representation. By contrast, the deep convolutional layers contain rich semantic information but lack spatial context. It is also capable of incorporating high-level semantic cues and low-level detailed information in a data-driven framework. Finally, to efficiently preserve object boundaries and uniform interior region, we adopt a fully connected conditional random field (CRF) model to refine the estimated saliency map. Result Extensive experiments are conducted on the six most widely used and challenging benchmark datasets, namely, DUT-OMRON(Dalian University of Technology and OMRON Corporation), ECSSD(extended complex scene saliency dataset), SED2(segmentation evalution database 2), HKU, PASCAL-S, and SOD (salient objects dataset). The F-measure scores of our proposed scheme on these six benchmark datasets are 0.696, 0.876, 0.797, 0.868, 0.772, and 0.785, respectively. The max F-measure scores are 0.747, 0.899, 0.859, 0.889, 0.814, and 0.833, respectively. The weighted F-measure scores are 0.656, 0.854, 0.772, 0.844, 0.732, and 0.762, respectively. The mean absolute error (MAE) scores are 0.074, 0.061, 0.093, 0.049, 0.099, and 0.124, respectively. We compare our proposed method with 14 state-of-the-art methods as well. Results demonstrate the efficiency and robustness of the proposed approach against the 14 state-of-the-art methods in terms of popular evaluation metrics. Conclusion We propose an efficient FCN-like salient object detection model that can generate rich and efficient features. The algorithm used in this study is robust to image saliency detection in various scenarios. Simultaneously, the boundary and inner area of the salient object are uniform, and the detection result is accurate.
Keywords

订阅号|日报