实时视觉目标跟踪与视频对象分割多任务框架
摘 要
目的 针对视觉目标跟踪(video object tracking,VOT)和视频对象分割(video object segmentation,VOS)问题,研究人员提出了多个多任务处理框架,但是该类框架的精确度和鲁棒性较差。针对此问题,本文提出一个融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架。方法 文中提出的架构使用了由空洞深度可分离卷积组成的更加多尺度的空洞空间金字塔池化模块,以及具备帧间信息的帧间掩模传播模块,使得网络对多尺度目标对象分割能力更强,同时具备更好的鲁棒性。结果 本文方法在视觉目标跟踪VOT-2016和VOT-2018数据集上的期望平均重叠率(expected average overlap,EAO)分别达到了0.462和0.408,分别比SiamMask高了0.029和0.028,达到了最先进的结果,并且表现出更好的鲁棒性。在视频对象分割DAVIS(densely annotated video segmentation)-2016和DAVIS-2017数据集上也取得了有竞争力的结果。其中,在多目标对象分割DAVIS-2017数据集上,本文方法比SiamMask有更好的性能表现,区域相似度的杰卡德系数的平均值JM和轮廓精确度的F度量的平均值FM分别达到了56.0和59.0,并且区域和轮廓的衰变值JD和FD都比SiamMask中的低,分别为17.9和19.8。同时运行速度为45帧/s,达到了实时的运行速度。结论 文中提出的融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架,充分捕捉了多尺度上下文信息并且利用了视频帧间的信息,使得网络对多尺度目标对象分割能力更强的同时具备更好的鲁棒性。
关键词
Multitask framework for video object tracking and segmentation combined with multi-scale interframe information
Li Han1, Liu Kunhua1, Liu Jiajie1, Zhang Xiaoye2(1.School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;2.Guangdong Diankeyuan Energy Technology Co., Ltd, Guangzhou 510080, China) Abstract
Objective Visual object tracking (VOT) is widely used in scenes, such as car navigation, automatic video surveillance, and human-computer interaction. It is a basic research task in video applications and needs to infer the correspondence between the target and the frame. Given the position of any object of interest in the first frame of the video, its position is estimated in all subsequent frames with the highest possible accuracy. Similar to VOT, semi-supervised video object segmentation (VOS) requires segmentation of target objects on subsequent video sequences given the initial frame mask. It is also a basic research task of computer vision. However, the target object may experience large changes in pose, proportion, and appearance in the entire video sequence. It may encounter abnormal conditions, such as occlusion, rapid movement, and truncation. Therefore, performing robust VOT and VOS in a semi-supervised manner in video sequences is challenging. The continuous nature of the video sequence itself brings additional contextual information to VOS. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In VOS, the information from previous frames can be regarded as temporal context and can provide useful hints for subsequent predictions. Therefore, the effective use of additional information brought by video is extremely important for video tasks. For the research of VOT and VOS, various multitask processing frameworks have been proposed by scholars. However, the accuracy and robustness of such frameworks are poor. This paper proposes a multitask end-to-end framework for real-time VOT and VOS to address these problems. This framework combines multi-scale context information and video interframe information. Method In this work, depthwise convolution is changed from depthwise convolution to atrous depthwise convolution, thereby forming the atrous depthwise separable convolution. In accordance with different atrous ratios, the convolution can have different receptive fields while maintaining its lightweight. This study designs an atrous spatial pyramid pooling module with many atrous ratios composed of atrous depthwise separable convolution and applies it to the VOS branch. The network can capture multi-scale context. This work uses 1, 3, 6, 9, 12, 24, 36, and 48 atrous ratios to convolve the feature map with different receptive fields and utilizes adaptive pooling for the feature map. These feature maps are concatenated, and a convolution kernel of 1×1 is used to transform the feature map channel. The feature map outputted by the model has rich multi-scale context information through these operations. This module uses the atrous depthwise separable convolution with different atrous rates for enabling the network to predict multi-scale targets. Continuity is a unique property of video sequences and causes additional contextual information to video tasks. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In the VOS, the information from previous frames can be regarded as temporal context information and can provide useful hints for subsequent predictions. Therefore, the effective use of additional information brought by video is extremely important for video tasks. Inspired by the reference-guided mask propagation algorithm, a mask propagation module is added to the VOS branch for providing location and segmentation information to the network. The proposed mask propagation module is composed of 3×3 convolutions with atrous ratios of 2, 3, and 6. In our architecture, a multi-scale atrous spatial pyramid pooling module composed of atrous depthwise separable convolutions and an interframe mask propagation module with interframe information are used. These modules provide the network with strong ability to segment multi-scale target objects and has better robustness. Result All experiments in this work are performed using NVIDIA TITAN X graphics cards. The network in this article is trained in two stages. The training sets used in different stages are different due to their different nature. In the first stage of training, this work uses Youtube-VOS, common objects in context(COCO), DETection(ImageNet-DET), and ImageNet-VID (VIDeo) datasets. For the datasets without mask ground truth, the mask branch is not trained. For a video sequence with only a single frame, the picture and mask of the previous frame are set in the interframe mask propagation module to be the same as the current frame. Inspired by SiamMask, this article uses stochastic gradient descent optimizer algorithm and a warm-up training strategy. The learning rate increases from 1×10-3 to 5×10-3 in the first 5 epochs. A logarithmic decay strategy was then used to reduce the learning rate to 2.5×10-4 through 15 epochs of learning. In the second stage, this article only uses the Youtube-VOS and COCO datasets for training. The two datasets have mask truth values to improve the segmentation effect of video objects. The second stage uses a logarithmic decay strategy to reduce the learning rate from 2.5×10-4 to 1.0×10-4 through 20 epochs. The expected average overlaps of the proposed method on the VOT-2016 and VOT-2018 datasets reach 0.462 and 0.408, respectively, which is approximately 0.03 higher than SiamMask. The proposed method achieves advanced results and shows better robustness. Competitive results are also achieved on the DAVIS-2016 and DAVIS-2017 datasets of VOS. On DAVIS-2017 dataset of multitarget object segmentation, the proposed method has better performance than SiamMask. The evaluation indexes JM and FM reach 56.0 and 59.0, respectively, and the decay values of the region and the contour are JD and FD. Their values are 17.9 and 19.8, respectively, which are lower than those in SiamMask. The running speed is 45 frames per second, reaching a real-time running speed. Conclusion In this study, we proposed a multitask end-to-end framework of real-time VOT and VOS. The proposed method integrates multi-scale context information and video interframe information, fully captures multi-scale context information, and utilizes the information between video frames. These features make the network robust to segmentation of multi-scale target objects.
Keywords
visual object tracking(VOT) video object segmentation(VOS) fully convolutional network(FCN) atrous spatial pyramid pooling inter-frame mask propagation
|