Current Issue Cover
多支路协同的RGB-T图像显著性目标检测

蒋亭亭1, 刘昱1, 马欣2, 孙景林1(1.天津大学微电子学院, 天津 300072;2.天津大学电气自动化与信息工程学院, 天津 300072)

摘 要
目的 显著性目标检测是机器视觉应用的基础,然而目前很多方法在显著性物体与背景相似、低光照等一些复杂场景得到的效果并不理想。为了提升显著性检测的性能,提出一种多支路协同的RGB-T(thermal)图像显著性目标检测方法。方法 将模型主体设计为两条主干网络和三条解码支路。主干网络用于提取RGB图像和Thermal图像的特征表示,解码支路则分别对RGB特征、Thermal特征以及两者的融合特征以协同互补的方式预测图像中的显著性物体。在特征提取的主干网络中,通过特征增强模块实现多模图像的融合互补,同时采用适当修正的金字塔池化模块,从深层次特征中获取全局语义信息。在解码过程中,利用通道注意力机制进一步区分卷积神经网络(convolutional neural networks,CNN)生成的特征在不同通道之间对应的语义信息差异。结果 在VT821和VT1000两个数据集上进行测试,本文方法的最大F-measure值分别为0.843 7和0.880 5,平均绝对误差(mean absolute error,MAE)值分别为0.039 4和0.032 2,相较于对比方法,提升了整体检测性能。结论 通过对比实验表明,本文提出的方法提高了显著性检测的稳定性,在一些低光照场景取得了更好效果。
关键词
Multi-path collaborative salient object detection based on RGB-T images

Jiang Tingting1, Liu Yu1, Ma Xin2, Sun Jinglin1(1.School of Microelectronics, Tianjin University, Tianjin 300072, China;2.School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China)

Abstract
Objective Saliency detection is a fundamental technology in computer vision and image processing, which aims to identify the most visually distinctive objects or regions in an image. As a preprocessing step, salient object detection plays a critical role in many computer vision applications, including visual tracking, scene classification, image retrieval, and content-based image compression. While numerous salient object detection methods have been presented, most of them are designed for RGB images only or depth RGB (RGB-D) images. However, these methods remain challenging in some complex scenarios. RGB methods may fail to distinguish salient objects from backgrounds when exposed to similar foreground and background or low-contrast conditions. RGB-D methods also suffer from challenging scenarios characterized by low-light conditions and variations in illumination. Considering that thermal infrared images are invariant to illumination conditions, we propose a multi-path collaborative salient object detection method in this study, which is designed to improve the performance of saliency detection by using the multi-mode feature information of RGB and thermal images. Method In this study, we design a novel end-to-end deep neural network for thermal RGB (RGB-T) salient object detection, which consists of an encoder network and a decoder network, including the feature enhance module, the pyramid pooling module, the channel attention module, and the l1-norm fusion strategy. First, the main body of the model contains two backbone networks for extracting the feature representations of RGB and thermal images, respectively. Then, three decoding branches are used to predict the saliency maps in a coordinated and complementary manner for extracted RGB feature, thermal feature, and fusion feature of both, respectively. The two backbone network streams have the same structure, which is based on Visual Geometry Group 19-layer (VGG-19) net. In order to make a better fit with saliency detection task, we only maintain five convolutional blocks of VGG-19 net and discard the last pooling and fully connected layers to preserve more spatial information from the input image. Second, the feature enhance module is used to fully extract and fuse multi-modal complementary cues from RGB and thermal streams. The modified pyramid pooling module is employed to capture global semantic information from deep-level features, which is used to locate salient objects. Finally, in the decoding process, the channel attention mechanism is designed to distinguish the semantic differences between the different channels, thereby improving the decoder's ability to separate salient objects from backgrounds. The entire model is trained in an end-to-end manner. Our training set consists of 900 aligned RGB-T image pairs that are randomly selected from each subset of the VT1000 dataset. To prevent overfitting, we augment the training set by flipping and rotating operations. Our method is implemented with PyTorch toolbox and trained on a PC with GTX 1080Ti GPU and 11 GB of memory. The input images are uniformly resized to 256×256 pixels. The momentum, weight decay, and learning rate are set as 0.9, 0.000 5, and 1E-9, respectively. During training, the softmax entropy loss is used to converge the entire network. Result We compare our model with four state-of-the-art saliency models, including two RGB-based methods and two RGB-D-based methods, on two public datasets, namely, VT821 and VT1000. The quantitative evaluation metrics contain F-measure, mean absolute error (MAE), and precision-recall(PR) curves, and we also provide several saliency maps of each method for visual comparison. The experimental results demonstrate that our model outperforms other methods, and the saliency maps have more refined shapes under challenging conditions, such as poor illumination and low contrast. Compared with the other four methods in VT821, our method obtains the best results on maximum F-measure and MAE. The maximum F-measure (higher is better) increases by 0.26%, and the MAE (less is better) decreases by 0.17% than the second-ranked method. Compared with the other four methods in VT1000, our model also achieves the best result on maximum F-measure, which reaches 88.05% and increases by 0.46% compared with the second-ranked method. However, the MAE is 3.22%, which increases by 0.09% and is slightly poorer than the first-ranked method. Conclusion We propose a CNN-based method for RGB-T salient object detection. To the best of our knowledge, existing saliency detection methods are mostly based on RGB or RGB-D images, so it is very meaningful to explore the application of CNN for RGB-T salient object detection. The experimental results on two public RGB-T datasets demonstrate that the method proposed in this study performs better than the state-of-the-art methods, especially for challenging scenes with poor illumination, complex background, or low contrast, which proves that it is effective to improve the performance by fusing multi-modal information from RGB and thermal images. However, public datasets for RGB-T salient detection are lacking, which is very important for the performance of deep learning network. At the same time, detection speed is a key measurement in the preprocessing step of other computer vision tasks. Thus, in the future work, we will collect more high-quality datasets for RGB-T salient detection and design more light-weight models to increase the speed of detection.
Keywords

订阅号|日报