超像素条件随机场下的RGB-D视频显著性检测
摘 要
目的 视觉显著性在众多视觉驱动的应用中具有重要作用,这些应用领域出现了从2维视觉到3维视觉的转换,从而基于RGB-D数据的显著性模型引起了广泛关注。与2维图像的显著性不同,RGB-D显著性包含了许多不同模态的线索。多模态线索之间存在互补和竞争关系,如何有效地利用和融合这些线索仍是一个挑战。传统的融合模型很难充分利用多模态线索之间的优势,因此研究了RGB-D显著性形成过程中多模态线索融合的问题。方法 提出了一种基于超像素下条件随机场的RGB-D显著性检测模型。提取不同模态的显著性线索,包括平面线索、深度线索和运动线索等。以超像素为单位建立条件随机场模型,联合多模态线索的影响和图像邻域显著值平滑约束,设计了一个全局能量函数作为模型的优化目标,刻画了多模态线索之间的相互作用机制。其中,多模态线索在能量函数中的权重因子由卷积神经网络学习得到。结果 实验在两个公开的RGB-D视频显著性数据集上与6种显著性检测方法进行了比较,所提模型在所有相关数据集和评价指标上都优于当前最先进的模型。相比于第2高的指标,所提模型的AUC(area under curve),sAUC(shuffled AUC),SIM(similarity),PCC(Pearson correlation coefficient)和NSS(normalized scanpath saliency)指标在IRCCyN数据集上分别提升了2.3%,2.3%,18.9%,21.6%和56.2%;在DML-iTrack-3D数据集上分别提升了2.0%,1.4%,29.1%,10.6%,23.3%。此外还进行了模型内部的比较,验证了所提融合方法优于其他传统融合方法。结论 本文提出的RGB-D显著性检测模型中的条件随机场和卷积神经网络充分利用了不同模态线索的优势,将它们有效融合,提升了显著性检测模型的性能,能在视觉驱动的应用领域发挥一定作用。
关键词
RGB-D video saliency detection via superpixel-level conditional random field
Li Bei, Yang You, Liu Qiong(School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China) Abstract
Objective Visual saliency detection aims to identify the most attractive objects or regions in an image and acts a fundamental role in many vision-based applications, such as target detection and tracking, visual content analysis, scene classification, image/video compression, image quality evaluation, and pedestrian detection. In recent years, the new paradigm shifts from 2D to 3D vision have triggered many interesting functionalities for those vision applications, but the traditional RGB saliency detection models cannot produce satisfactory results in these applications. Thus, visual saliency detection models based on RGB-D data, which involve different modality visual cues, have attracted a large amount of research interest. Existing RGB-D saliency detection models usually consist of two stages. In the first stage, multimodality visual cues, including spatial, depth, and motion cues, are extracted from the color map and the depth map. In the second stage, these cues are fused to obtain the final saliency map via various fusion methods, such as linear weighted summation, Bayesian framework, and conditional random field (CRF). In recent years, learning-based fusion methods, such as support vector machines, AdaBoost, random forest, and deep neural networks, have been widely studied. Several of the above fusion methods have achieved good results in the RGB saliency model. However, different from the traditional RGB saliency detection models, in most of the cases, the involved multimodality visual cues, especially the saliency results, are mutually substantially different from one another. The difference reveals the rivalry in multimodality saliency cues and brings difficulties to the fusion stage in RGB-D saliency models. Therefore, under the two-stage framework, a new challenge arises from design suitable features that can be designed for saliency maps of corresponding multimodality visual cues to increase the probability of mutual fusion in the first stage and how these saliency maps can be fused to obtain the final RGB-D visual saliency map in the second stage. Method An RGB-D saliency detection model is proposed based on superpixel-level CRF, and 3D scenes are represented by the video format of RGB maps and corresponding depth maps. The predicted saliency map is obtained in two stages for multimodality saliency cues and final fusion. Multimodality saliency cues, including spatial, depth, and motion cues, are considered, and three independent saliency maps for these cues are computed. A saliency fusion algorithm is proposed based on the superpixel-level CRF model. The graph structure of the CRF model is constructed by taking the superpixels as graph nodes, and each superpixel is connected to its adjacent superpixels. Based on the graph, a global energy function is designed to consider the influence of the involved multimodality saliency cues and the smoothing constraint between neighboring superpixels jointly. The global energy function consists of a data term and a smooth term. The data term describes the effects of the multimodality saliency maps on the final fused saliency maps. Three weighting maps of the multimodality saliency cues are learned via a convolutional neural network (CNN)and added to the data term because multimodality saliency maps play different roles in various scenarios. The smooth term adds constraints to the difference of the saliency values of adjacent superpixels, and the constraint intensity is controlled by the RGB and depth differences between them. When the difference values of RGB and depth vectors between two adjacent superpixels are smaller, these two adjacent pixels are more likely to have similar saliency values. The final predicted saliency map is obtained by optimizing the global energy function. Result In experiments, the proposed model is compared with six state-of-the-art saliency detection models on two public RGB-D video saliency datasets, namely, IRCCyN and DML-iTrack-3D. Five popular quantitative metrics are used to evaluate the proposed model, including area under curve (AUC), shuffled AUC (sAUC), similarity (SIM), Pearson correlation coefficient (PCC), and normalized scanpath saliency (NSS). Experimental results show that the proposed model outperforms state-of-the-art models on all involved datasets and evaluation metrics. Compared with the second highest scores, the AUC, sAUC, SIM, PCC, and NSS of our model increase by 2.3%, 2.3%, 18.9%, 21.6%, and 56.2%, respectively, on IRCCyN datasets, and increase by 2.0%, 1.4%, 29.1%, 10.6%, and 23.3%, respectively, on DML-iTrack-3D datasets. Moreover, the saliency maps of different visual cues and traditional fusion methods show that the proposed model achieves the best performance, and the proposed fusion method effectively takes advantage of different visual cues. To verify the benefit of the proposed CNN-based weight-learning network, the weights of multimodality saliency maps are set to same value. The experimental results show that performance decreases after removing the weight-learning network. Conclusion In this study, an RGB-D saliency detection model based on superpixel-level CRF is proposed. The multimodality visual cues are first extracted and then fused by utilizing the CRF model with a global energy function. The fusion stage jointly considers the effects of the multimodality visual cues and the smoothing constraint of the saliency values of adjacent superpixels. Therefore, the proposed model makes full use of the advantages of multimodality visual cues and avoids the conflict caused by the competition among them, thus achieving better fusion results. The experimental results show that the five evaluation metrics of the proposed model are better than those of other start-of-the-art models in two RGB-D video saliency datasets. Thus, the proposed model can use the correlation among multimodality visual cues to detect the saliency objects or regions in 3D dynamic scenes effectively, which is believed helpful for 3D vision-based applications. In addition, the proposed model is a simple, intuitive combination of the traditional method and the deep learning method, and the combination of these two methods can still be improved greatly. The future study will further focus on how to combine traditional methods and deep learning methods more effectively.
Keywords
RGB-D saliency saliency fusion conditional random field(CRF) global energy function convolutional neural network(CNN)
|