Current Issue Cover
时空特征融合网络的多目标跟踪与分割

刘雨亭, 张开华, 樊佳庆, 刘青山(南京信息工程大学数字取证教育部工程研究中心, 南京 210044)

摘 要
目的 多目标跟踪与分割是计算机视觉领域一个重要的研究方向。现有方法多是借鉴多目标跟踪领域先检测然后进行跟踪与分割的思路,这类方法对重要特征信息的关注不足,难以处理目标遮挡等问题。为了解决上述问题,本文提出一种基于时空特征融合的多目标跟踪与分割模型,利用空间三坐标注意力模块和时间压缩自注意力模块选择出显著特征,以此达到优异的多目标跟踪与分割性能。方法 本文网络由2D编码器和3D解码器构成,首先将多幅连续帧图像输入到2D编码层,提取出不同分辨率的图像特征,然后从低分辨率的特征开始通过空间三坐标注意力模块得到重要的空间特征,通过时间压缩自注意力模块获得含有关键帧信息的时间特征,再将两者与原始特征融合,然后与较高分辨率的特征共同输入3D卷积层,反复聚合不同层次的特征,以此得到融合多次的既有关键时间信息又有重要空间信息的特征,最后得到跟踪和分割结果。结果 实验在YouTube-VIS (YouTube video instance segmentation)和KITTI MOTS (multi-object tracking and segmentation)两个数据集上进行定量评估。在YouTube-VIS数据集中,相比于性能第2的CompFeat模型,本文方法的AP (average precision)值提高了0.2%。在KITTI MOTS数据集中,相比于性能第2的STEm-Seg模型,在汽车类上,本文方法的ID switch指标减少了9;在行人类上,本文方法的sMOTSA (soft multi-object tracking and segmentation accuracy)、MOTSA (multi-object tracking and segmentation accuracy)和MOTSP (multi-object tracking and segmentation precision)分别提高了0.7%、0.6%和0.9%,ID switch指标减少了1。在KITTI MOTS数据集中进行消融实验,验证空间三坐标注意力模块和时间压缩自注意力模块的有效性,消融实验结果表明提出的算法改善了多目标跟踪与分割的效果。结论 提出的多目标跟踪与分割模型充分挖掘多帧图像之间的特征信息,使多目标跟踪与分割的结果更加精准。
关键词
Spatiotemporal feature fusion network based multi-objects tracking and segmentation

Liu Yuting, Zhang Kaihua, Fan Jiaqing, Liu Qingshan(Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China)

Abstract
Objective multiple-objects-oriented tracking and segmentation aims to track and segment a variety of video-based objects, which is concerned about detection, tracking and segmentation. Such existing methods are derived of tracking and segmenting in the context of multi-objects tracking detection. But, it is challenged to resolve the target occlusion and its contexts for effective features extraction. Our research is focused on a joint multi-object tracking and segmentation method based on the 3D spatiotemporal feature fusion module (STFNet), spatial tri-coordinated attention (STCA) and temporal-reduced self-attention (TRSA), which is adaptively for salient feature representations selection to optimize tracking and segmentation performance. Method The STFNet is composed of a 2D encoder and a 3D decoder. First, multiple frames are put into the 2D encoder in consistency and the decoder takes low-resolution features as input. The low-resolution features is implemented for feature fusion through 3 layers of 3D convolutional layers, the spatial features of the key spatial information is then obtained via STCA module, and the key-frame-information-involved temporal features are integrated through the TRSA. They are all merged with the original features. Next, the higher resolution features and the low-level fusion features are put into the 3D convolutional layer (1×1×1) together, and the features of different levels are replicable to aggregate the features with key frame information and salient spatial information. Finally, our STFNet is fitted to the features into the three-dimensional Gaussian distribution of each case. Every Gaussian distribution is assigned different pixels on continuous frames for multi-scenario objects or their background. It can achieve the segmentation of the each target. Specifically, STCA is focused on the attention-enhanced version of the coordinated attention. The coordinated attention is based on the horizontal and vertical attention weights only. The attention mechanism can be linked to the range of local information without the dimensioned channel information. The STCA is added a channel-oriented attention mechanism to retain valid information or discard useless information further. First, the STCA is used to extract horizontal/vertical/channel-oriented features via average pooling. It can encode the information from the three coordinated directions for subsequent extraction of weight coefficients. Next, the STCA performs two-by-two fusion of the three features. Furthermore, it concatenates them and put them into the 1×1 convolution for feature fusion. Next, the STCA is used to put them into the batch of benched layer and a non-linear activation function. The features of the three coordinate directions can be fused. Third, the fusion features are leveraged to obtain the separate attention features for each coordinate direction. The attention features of the same direction are added together, and each direction is obtained through the sigmoid function. Finally, to get the output of the STCA, the weight is multiplied by the original feature. For the TRSA, to solve the frequent occlusion in multi-targets tracking and segmentation, temporal-based multi-features are selected. The designed module make the network pay more attention to the object information of the key frame and weaken the information of the occluded frame. 1) The TRSA put features into three 1×1×1 3D convolutions, and the purpose of the convolutional layer is oriented for reducing dimensions. 2) The TRSA is used to fuse dimensions to obtain three matrices excluded temporal scale, and one-dimension convolution is used to realize dimensionality reduction. It can greatly reduce the amount of matrix operations in the latter stage. 3) The TRSA is used to obtain a low-dimensional matrix through transposing two of the matrices, multiplying the non-transposed matrix and the transposed matrix. The result is put into the SoftMax function to get the attention weight. 4) The attention weight is used to multiply the original feature. The features are restored to the original dimension after increasing dimension, rearranging dimension and entering 3D convolution. Result Our main testing datasets are based on YouTube video instance segmentation(YouTube-VIS) and multi-object tracking and segmentation(KITTI MOTS). For the YouTube-VIS dataset, we combine YouTube-VIS and common objects in context(COCO) training for the COCO training set and the overlapped part is just 20 object classes. The size of the input image is 640×1 152 pixels. The evaluation indicators like average precision (AP) and average recall (AR) in MaskTrack region convolutional neural network(R-CNN) are used to evaluate the performance of model tracking and segmentation. For the KITTI MOTS dataset, it is trained on the KITTI MOTS training set. The size of input image is 544×1 792 pixels. The evaluation indicators like soft multi-object tracking and segmentation accuracy(sMOTSA), multi-object tracking and segmentation accuracy(MOTSA), multi-object tracking and segmentation precision(MOTSP), and ID switch(IDS) in TrackR-CNN are used to evaluate the performance of model tracking and segmentation. Our data results are augmented by random horizontal flip, video reverse order, and image brightness enhancement. Our experiments are based on using ResNet-101 as the backbone network and initializing the backbone network with the weights of the Mask R-CNN pre-training model trained on the COCO training set. The decoder network weights are used for random initialization weights method. Three loss functions are used for training. The Lovász Hinge loss function is targeted for learning the feature embedding vector, the smoothness loss function is oriented for learning the variance value, and the L2 loss is used for generating the instance center heat map, respectively. For the YouTube-VIS dataset, the AP value is increased by 0.2% compared to the second-performing CompFeat. For the KITTI MOTS dataset, in the car category, compared to the second-performance STEm-Seg, the ID switch index is reduced by 9; in the pedestrian category, compared to the second-performance STEm-Seg, sMOTSA is increased by 0.7%, and MOTSA is increased 0.6%, MOTSP is increased by 0.9%, ID switch index is decreased by 1. At the same time, our ablation experiments are carried out in the KITTI MOTS dataset. The STCA improves the network effect by 0.5% in comparison with the baseline, and the TRSA improves the network effect by 0.3% compared to the baseline. The result shows that our two modules are relatively effective. Conclusion We demonstrate a multi-objects tracking and segmentation model based on spatiotemporal feature fusion. The model can fully mine the feature information between multiple frames of the video, and it makes the results of target tracking and segmentation more accurate. The Experimental results illustrate that our potential STFNet can optimize target occlusion to a certain extent.
Keywords

订阅号|日报