Current Issue Cover
时序特征融合的视频实例分割

黄泽涛1, 刘洋1, 于成龙2, 张加佳1, 王轩1,3, 漆舒汉1,3(1.哈尔滨工业大学(深圳), 深圳 518055;2.深圳信息职业技术学院, 深圳 518172;3.鹏城实验室, 深圳 518052)

摘 要
目的 随着移动互联网和人工智能的蓬勃发展,海量的视频数据不断产生,如何对这些视频数据进行处理分析是研究人员面临的一个挑战性问题。视频中的物体由于拍摄角度、快速运动和部分遮挡等原因常常表现得模糊和多样,与普通图像数据集的质量存在不小差距,这使得对视频数据的实例分割难度较大。目前的视频实例分割框架大多依靠图像检测方法直接处理单帧图像,通过关联匹配组成同一目标的掩膜序列,缺少对视频困难场景的特定处理,忽略对视频时序信息的利用。方法 本文设计了一种基于时序特征融合的多任务学习视频实例分割模型。针对普通视频图像质量较差的问题,本模型结合特征金字塔和缩放点积注意力机制,在时间上把其他帧检测到的目标特征加权聚合到当前图像特征上,强化了候选目标的特征响应,抑制背景信息,然后通过融合多尺度特征丰富了图像的空间语义信息。同时,在分割网络模块增加点预测网络,提升了分割准确度,通过多任务学习的方式实现端到端的视频物体同时检测、分割和关联跟踪。结果 在YouTube-VIS验证集上的实验表明,与现有方法比较,本文方法在视频实例分割任务上平均精度均值提高了2%左右。对比实验结果证明提出的时序特征融合模块改善了视频分割的效果。结论 针对当前视频实例分割工作存在的忽略对视频时序上下文信息的利用,缺少对视频困难场景进行处理的问题,本文提出融合时序特征的多任务学习视频实例分割模型,提升对视频中物体的分割效果。
关键词
Video instance segmentation based on temporal feature fusion

Huang Zetao1, Liu Yang1, Yu Chenglong2, Zhang Jiajia1, Wang Xuan1,3, Qi Shuhan1,3(1.Harbin Institute of Technology, Shenzhen 518055, China;2.Shenzhen Institute of Information Technology, Shenzhen 518172, China;3.Peng Cheng Laboratory, Shenzhen 518052, China)

Abstract
Objective With the rapid development of mobile internet and artificial intelligence, a growing number of video applications are gradually occupying people’s daily life. Large volumes of video data are generated every day. In addition to the large number and high memory occupation of video data, the video content itself is complex, and often contains many characters, actions, and scenes. Thus, the video task is more challenging and urgent than the common image understanding task. How to process and analyze these video data is a challenging problem for many researchers. Due to the shooting angle and fast motion, the objects in the video often appear fuzzy and diverse, and a wide gap exists between the quality of the common image data set and that of the video dataset. Video instance segmentation is an extension of instance segmentation in the video field, which includes the detecting, segmenting, and tracking object instances. The method not only needs to assign the pixels of each frame to the corresponding semantic categories and object instances but also associate the instance objects across the entire video sequence. The problems of video defocus, motion blur, and partial occlusion in video images cause difficulty in video instance segmentation and result in poor performance. The existing video-instance segmentation algorithms mainly use the image-instance segmentation algorithms to further predict the target mask in every frame. Then, tracking algorithms are used to associate the detection results to generate the mask sequence along the video to solve the problem of instance segmentation in video. However, these algorithms rely on the initial image detection performance, and ignore the use of temporal context information, resulting in the lack of effective transmission and exchange of information between different frames, which makes the classification and segmentation performance not ideal in difficult video scenes. Method To solve this problem, this study designs a multi-task learning video instance segmentation model based on temporal feature fusion. We combine the feature pyramid network and scaled dot-product attention operation in the temporal domain. Feature pyramid network is a feature extractor designed according to the concept of feature pyramid, which aims to improve the accuracy and speed. It replaces the feature extractor in fast region convolutional neural network(R-CNN) and generates a higher-quality feature graph pyramid. In general, the feature pyramid network has two feature fusion ways of a bottom-up line and a top-down line. The bottom-up way is the forward process of the network, but top-down is intended to sample the top-level features, and then conduct element-wise addition with the corresponding features of the previous layer. The scaled dot-product attention is the basic component of the multi-head attention module in the transformer, which is a popular encoder-to-decoder attention network in machine translation. With the temporal feature fusion module, the object features detected by other frames are weighted and aggregated to the current image features to strengthen the feature response of candidate object and suppress the background information. Then, the spatial semantic information of the image is enriched by fusing multi-scale features of the current frame. Thus, the model can capture the fine correlation information between other frames and the current frame, and selectively aggregate the important features of other frames to enhance the representation of the current features. At the same time, point prediction network added to the segmentation module improves the segmentation precision compared with the general segmentation network of fully convolutional neural network. Then, the objects are detected, segmented, and tracked simultaneously in the video by our end-to-end multi-task learning video instance segmentation framework. Result Experiments on YouTube-VIS dataset show that our method improves the mean average precision of video instance segmentation by near 2% compared with current methods. We also conduct a series of ablation experiments. On the one hand, we add different segmentation network modules in the model, and compare the effect of the fully convolutional network and point predict segmentation network on the two-stage video instance segmentation model. On the other hand, because the temporal feature-fusion module needs to select the RPN(region proposal network) candidate objects of the auxiliary frame for information fusion in the training stage, experimental comparison is needed for different number settings of RPN objects. We find the best result 32.7% AP with 10 RPN objects using. This result shows that the proposed temporal feature-fusion module improves the effect of video segmentation. Conclusion In this study, a two-stage video-instance segmentation model with temporal feature fusion module is proposed. In the first stage, the backbone network ResNet extracts features from an input image, and the temporal feature-fusion module further extracts features of multiple scales through feature pyramid networks, and aggregates the object feature information detected by other frames to enhance the feature response of the current frame. Then, the region proposal network extracts multiple candidate objects from the image. In the second stage, the features of proposal objects are input into three parallel network heads to obtain the corresponding results. The detection network head obtains the object classification and position in the current image, the segmentation network head obtains the instance segmentation mask of the current image, and the associated network head achieves the continuous association of the object by matching the most similar instance object in the instance storage space. In summary, our video instance segmentation model combines the feature pyramid network and scaled dot-product attention operation to the video temporal-feature fusion, which improves the accuracy of the video segmentation result.
Keywords

订阅号|日报