稀疏深度特征对传统显著性检测的优化
摘 要
目的 显著性目标检测算法主要分为基于低级特征的传统方法和基于深度学习的新方法,传统方法难以捕获对象的高级语义信息,基于深度学习的新方法能捕获高级语义信息却忽略了边缘特征。为了充分发挥两种方法的优势,基于将二者结合的思路,本文利用稀疏能使得显著性对象指向性凝聚的优势,提出了一种基于稀疏自编码和显著性结果优化的方法。方法 对VGG(visual geometry group)网络第4个池化层的特征图进行稀疏自编码处理,得到5张稀疏显著性特征图,再与传统方法得到的显著图一起输入卷积神经网络进行显著性结果优化。结果 使用DRFI(discriminative regional feature integration)、HDCT(high dimensional color transform)、RRWR(regularized random walks ranking)和CGVS(contour-guided visual search)等传统方法在DUT-OMRON、ECSSD、HKU-IS和MSRA等公开数据集上进行实验,表明本文算法有效改善了显著性对象的F值和MAE(mean absolute error)值。在F值提高方面,优化后的DRFI方法提升最高,在HKU-IS数据集上提高了24.53%。在MAE值降低方面,CGVS方法降低最少,在ECSSD数据集上降低了12.78%,降低最多的接近50%。而且本模型结构简单,参数少,计算效率高,训练时间约5 h,图像的平均测试时间约为3 s,有很强的实际应用性。结论 本文提出了一种显著性结果优化算法,实验结果表明算法有效改善了显著性对象F值和MAE值,在对显著性对象检测要求越来越准确的对象识别等任务中有较好的适应性和应用性前景。
关键词
Optimization of traditional saliency detection by sparse depth features
Hong Shizhan, Cao Tieyong, Fang Zheng, Xiang Shengkai(Institute of Command and Control Engineering, Army Engineering University, Nanjing 210001, China) Abstract
Objective Saliency detection, as a preprocessing component of computer vision, has received increasing attention in the areas of object relocation, scene classification, semantic segmentation, and visual tracking. Although object detection has been greatly developed, it remains challenging because of a series of realistic factors, such as background complexity and attention mechanism. In the past, many significant target detection methods have been developed. These methods are mainly divided into traditional methods and new methods based on deep learning. The traditional approach is to find significant targets through low-level manual features, such as contrast, color, and texture. These general techniques are proven effective in maintaining image structure and reducing computational effort. However, these low-level features cause difficulty in capturing high-level semantic knowledge about objects and their surroundings. Therefore, these low-level feature-based methods do not achieve excellent results when salient objects are stripped from the stacked background. The saliency detection method based on deep learning mainly seeks significant targets by automatically extracting advanced features. However, most of these advanced models focus on the nonlinear combination of advanced features extracted from the final convolutional layer. The boundaries of salient objects are often extremely blurry due to the lack of low-level visual information such as edges. In these jobs, convolutional neural network (CNN) features are applied directly to the model without any processing. The features extracted from the CNN are generally high in dimension and contain a large amount of noise, thereby reducing the utilization efficiency of CNN features and revealing an opposite effect. Sparse methods can effectively aggregate the salient objects in a feature map and eliminate some of the noise interference. Sparse self-encoding is a sparse method. A traditional saliency recognition method based on sparse self-encoding and image fusion, combined with background prior and contrast analysis and VGG (visual geometry group) saliency calculation, is proposed to solve these problems. Method The proposed algorithm is mainly composed of the following:traditional saliency map extraction, VGG feature extraction, sparse self-encoding, and saliency result optimization. The traditional method to be improved is selected, and the corresponding saliency map is calculated. In this experiment, we select four traditional methods with excellent results, namely, discriminative regional feature integration (DRFI), high-dimensional color transform (HDCT), regularized random walks ranking (RRWR), and contour-guided visual search (CGVS). Then, the VGG network is used to extract feature maps. The feature maps obtained by each pooled layer are sparsely self-encoded to obtain 25 sparse saliency feature maps. When a feature map is selected, excessive edge information and texture information are retained because the features extracted by the first three pooling layers are mainly low-level features, indicating duplicate effects with feature maps obtained by the conventional method; thus, the feature maps from low-level are not used. The comparison between the fourth and fifth feature maps shows that the feature information of the fifth pooling layer is excessively lost. After experimental verification, the fifth layer characteristic map exerts an interference effect. Thus, we use the feature map extracted from the fourth pooling layer. Then, these feature maps are placed into the sparse self-encoder to perform the sparse operation to obtain five feature maps. Each feature map is integrated with the corresponding saliency map obtained in the previous volume. Finally, the neural network performs the operation and calculates the final saliency map. Result Our experiments involved four open datasets:DUT-OMRON, ECSSD, HKU-IS, and MSRA. Then, we obtained half of the images from the four datasets used in the experiment to form a training set and the remaining four test sets. The results obtained can be extremely credible. The following conclusions are drawn from the experiment. 1) The proposed model greatly improves the F value in the four datasets of the four methods, including an increase of 24.53% in the HKU-IS dataset of the DRFI method. 2) The MAE (mean absolute error) value has also been greatly reduced, the least of which is reduced by 12.78% for the ECSSD dataset of the CGVS method and the highest of which is reduced by nearly 50%. 3) The proposed model network has few layers, few parameters, and short calculation time. The training time is approximately 2 h, and the average test time of the image is approximately 0.2 s. On the contrary, Liu chooses an image saliency optimization scheme using adaptive fusion. The training time is approximately 47 h, and the average test time of the image is 56.95 s. The proposed model greatly improves the computational efficiency. 4) The proposed model achieves a significant improvement for the four datasets, especially the HKU-IS and MSRA datasets. These datasets contain difficult images, thereby confirming the effectiveness of the proposed method. Conclusion A low-level feature map based on traditional models, such as a texture and high-level feature map of a sparsely self-encoded VGG network, is proposed to optimize saliency results and greatly improve saliency target recognition. The traditional methods based on DRFI, HDCT, RRWR, and CGVS are tested in the publicly significant object detection datasets DUT-OMRON, ECSSD, HKU-IS, and MSRA, respectively. The obtained F value and MAE value are significantly improved, thereby confirming the effectiveness of the proposed method. Moreover, the method steps and network structure are simple and easy to understand, the training takes little time, and popular promotion can be easily obtained. The limitation of the study is that some of the extracted feature maps are missing. In practice, only the fourth layer of VGG maps is selected, and not all useful information is fully utilized.
Keywords
significant detection visual geometry group (VGG) sparse self-encoding image fusion convolutional neural network (CNN)
|