Current Issue Cover
双光流网络指导的视频目标检测

尉婉青, 禹晶, 史薪琪, 肖创柏(北京工业大学信息学部, 北京 100124)

摘 要
目的 卷积神经网络广泛应用于目标检测中,视频目标检测的任务是在序列图像中对运动目标进行分类和定位。现有的大部分视频目标检测方法在静态图像目标检测器的基础上,利用视频特有的时间相关性来解决运动目标遮挡、模糊等现象导致的漏检和误检问题。方法 本文提出一种双光流网络指导的视频目标检测模型,在两阶段目标检测的框架下,对于不同间距的近邻帧,利用两种不同的光流网络估计光流场进行多帧图像特征融合,对于与当前帧间距较小的近邻帧,利用小位移运动估计的光流网络估计光流场,对于间距较大的近邻帧,利用大位移运动估计的光流网络估计光流场,并在光流的指导下融合多个近邻帧的特征来补偿当前帧的特征。结果 实验结果表明,本文模型的mAP(mean average precision)为76.4%,相比于TCN(temporal convolutional networks)模型、TPN+LSTM(tubelet proposal network and long short term memory network)模型、D(&T loss)模型和FGFA(flow-guided feature aggregation)模型分别提高了28.9%、8.0%、0.6%和0.2%。结论 本文模型利用视频特有的时间相关性,通过双光流网络能够准确地从近邻帧补偿当前帧的特征,提高了视频目标检测的准确率,较好地解决了视频目标检测中目标漏检和误检的问题。
关键词
Dual optical flow network-guided video object detection

Yu Wanqing, Yu Jing, Shi Xinqi, Xiao Chuangbai(Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)

Abstract
Objective Object detection is a fundamental task in computer vision applications, and it provides support for subsequent object tracking, instance segmentation, and behavior recognition. The rapid development of deep learning has facilitated the wide use of convolutional neural network in object detection and shifted object detection from the traditional object detection method to the recent object detection method based on deep learning. Still image object detection has considerably progressed in recent years. It aims to determine the category and the position of each object in an image. The task of video object detection is to locate moving object in sequential images and assign the category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos, such as motion blur, multi-object occlusion, and rare poses. The methods of still image object detection have achieved excellent results. However, directly applying them to video object detection is challenging because still-image detectors may generate false negatives and positives caused by motion blur and object occlusion. Most existing video object detection methods incorporate temporal consistency across frames to improve upon single-frame detections. Method We propose a video object detection method guided by dual optical flow networks, which precisely propagate the features from adjacent frames to the feature of the current frame and enhance the feature of the current frame by fusing the features of the adjacent frames. Under the framework of two-stage object detection, the deep convolutional network model is used for the feature extraction to produce the feature in each frame of the video. According to the optical flow field, the features of the adjacent frames are used to compensate the feature of the current frame. According to the time interval between the adjacent frames and the current frame, two different optical flow networks are applied to estimate optical flow fields. Specifically, the optical flow network used for small displacement motion estimation is utilized to estimate the optical flow fields for closer adjacent frames. Moreover, the optical flow network used for large displacement motion estimation is utilized to estimate the optical flow fields for further adjacent frames. The compensated feature maps of multiple frames, as well as the feature map of the current frame, are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here, the similarity between the compensated feature map and the feature map extracted from the current frame is measured using the cosine similarity metric. If the compensated feature map gets close to the feature map of the current frame, then the compensated feature map is assigned a larger weight; otherwise, it is assigned a smaller weight. An embedding network that consists of three convolutional layers is also applied on the compensated feature maps and the current feature map to produce the embedding feature maps. Then, we utilize the embedding feature maps to compute the adaptive weights. Result Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet for video object detection (VID) dataset can achieve 76.42%, which is 28.92%, 8.02%, 0.62%, and 0.24% higher than those of the temporal convolutional network, the method combining tubelet proposal network(TPN) with long short memory network, the method of D(& T loss),and flow-guided feature aggregation (FGFA), respectively. We also report the mAP scores over the slow, medium, and fast objects. Our method combining the two optical flow networks improve the mAP scores of slow, medium, and fast objects by 0.2%, 0.48% and 0.23%, respectively, compared with the method of FGFA. Furthermore, that dual optical flow networks can improve the estimation of optical flow field between the adjacent frames and the current frame. Then, the feature of the current frame can be compensated more precisely using adjacent frames. Conclusion Considering the special temporal correlation of video, the proposed model improves the accuracy of video object detection through the feature aggregation guided by dual optical flow networks under the framework of the two-stage object detection. The usage of dual optical flow networks can accurately compensate the feature of the current frame from the adjacent frames. Accordingly, we can fully utilize the feature of each adjacent frame and reduce false negatives and positives through temporal feature fusion in video object detection.
Keywords

订阅号|日报