特征注意金字塔调制网络的视频目标分割
摘 要
目的 视频目标分割是在给定第1帧标注对象掩模条件下,实现对整个视频序列中感兴趣目标的分割。但是由于分割对象尺度的多样性,现有的视频目标分割算法缺乏有效的策略来融合不同尺度的特征信息。因此,本文提出一种特征注意金字塔调制网络模块用于视频目标分割。方法 首先利用视觉调制器网络和空间调制器网络学习分割对象的视觉和空间信息,并以此为先验引导分割模型适应特定对象的外观。然后通过特征注意金字塔模块挖掘全局上下文信息,解决分割对象多尺度的问题。结果 实验表明,在DAVIS 2016数据集上,本文方法在不使用在线微调的情况下,与使用在线微调的最先进方法相比,表现出更具竞争力的结果,J-mean指标达到了78.7%。在使用在线微调后,本文方法的性能在DAVIS 2017数据集上实现了最好的结果,J-mean指标达到了68.8%。结论 特征注意金字塔调制网络的视频目标分割算法在对感兴趣对象分割的同时,针对不同尺度的对象掩模能有效结合上下文信息,减少细节信息的丢失,实现高质量视频对象分割。
关键词
Video object segmentation via feature attention pyramid modulating network
Tang Runfa, Song Huihui, Zhang Kaihua, Jiang Sihao(Jiangsu Key Laboratory of Big Data Analysis Technology, Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China) Abstract
Objective Video object segmentation aims to separate a target object from the background and other instances on the pixel level. Segmenting objects in videos is a fundamental task in computer vision because of its wide applications, such as video surveillance, video editing, and autonomous driving. Video object segmentation suffers from the challenging factors of occlusion, fast motion, motion blur, and significant appearance variation over time. In this paper, we leverage modulators to learn the limited visual and spatial information of a given target object to adapt the general segmentation network to the appearance of a specific object instance. Existing video object segmentation algorithms lack appropriate strategies to use feature information of different scales due to the multi-scale segmentation objects. Therefore, we design a feature attention pyramid module for video object segmentation. Method To adapt the generic segmentation network to the appearance of a specific object instance in one single feed-forward pass, we employ two modulators, namely, visual modulator and spatial modulator, to learn to adjust the intermediate layers of the generic segmentation network given an arbitrary target object instance. The modulator produces a list of parameters by extracting information from the image of the annotated object and the spatial prior of the object, which are injected into the segmentation model for layer-wise feature manipulation. The visual modulator network is a convolutional neural network (CNN) that takes the annotated visual object image as input and produces a vector of scale parameters for all modulation layers. The visual modulator is used to adapt the segmentation network to focus on a specific object instance, which is the annotated object in the first frame. The visual modulator implicitly learns an embedding of different types of objects. It should produce similar parameters to adjust the segmentation network for similar objects, whereas it should produce different parameters for different objects. The spatial modulator network is an efficient network that produces bias parameters based on the spatial prior input. Given that objects move continuously in a video, we set the prior as the predicted location of the object mask in the previous frame. Specifically, we encode the location information as a heatmap with a 2D Gaussian distribution on the image plane. The center and standard deviations of the Gaussian distribution are computed from the predicted mask of the previous frame. The spatial modulator downsamples the heatmap into different scales to match the resolution of different feature maps in the segmentation network and then applies a scale-and-shift operation on each downsampled heatmap to generate the bias parameters of the corresponding modulation layer. The scale problem of the segmentation network can be solved by multi-scale pooling of the feature map. The feature fusion of different scales is used to achieve context information fusion of different receptive fields and the fusion of the overall contour and the texture details; thus, large-scale and small-scale object segmentation can be effectively combined with the context information to reduce the loss of detail information as possible, achieving high-quality pixel-level video object segmentation. PSPNet or DeepLab system performs spatial pyramid pooling at different grid scales or dilate rates (called atrous spatial pyramid pooling (ASPP)) to solve this problem. In the ASPP module, dilated convolution is a sparse calculation that may cause grid artifacts. On the one hand, the pyramid pooling module proposed in PSPNet may lose pixel-level localization information. These kinds of structure lack global context prior attention to select the features in a channel-wise manner as in SENet and EncNet. On the other hand, using channel-wise attention vector is not enough to extract multi-scale features effectively, and pixel-wise information is lacking. Inspired by SENet and ParseNet, we attempt to extract precise pixel-level attention for high-level features extracted from CNNs. Our proposed feature attention pyramid (FAP) module is capable of increasing the respective fields and classifying small and big objects effectively, thus solving the problem of multi-scale segmentation. Specifically, the FAP module combines the attention mechanism and the spatial pyramid and achieves context information fusion of different receptive fields by combining the features of different scales and simultaneously by means of the global context prior. We use the 30×30, 15×15, 10×10, and 5×5 pools in the pyramid structure, respectively, to better extract context from different pyramid scales. Then, the pyramid structure concatenates the information of different scales, which can incorporate context features precisely. Furthermore, the origin features from CNNs is multiplied in a pixel-wise manner by the pyramid attention features after passing through a 1×1 convolution. We also introduce the global pooling branch concatenated with output features. The feature map produces improved channel-wise attention to learn good feature representations so that context information can be effectively combined between segmentation of large-and small-scale objects. Benefiting from the spatial pyramid structure, the FAP module can fuse different scale context information and produce improved pixel-level attention for high-level feature maps in the meantime. Result We validate the effectiveness and robustness of the proposed method on the challenging DAVIS 2016 and DAVIS 2017 datasets. The proposed methoddemonstrates more competitive results on DAVIS 2016 compared with the state-of-art methods that use online fine-tuning, and it outperforms these methods on DAVIS 2017. Conclusion In this study, we first use two modulator networks to learn the visual and spatial information of the segmentation object mask. The visual modulator produces channel-wise scale parameters to adjust the weights of different channels in the feature maps, while the spatial modulator generates element-wise bias parameters to inject the spatial prior into the modulated features. We use the modulators as a prior guidance to enable the segmentation model to adapt to the appearance of specific objects. In addition to segmentation of objects of interest, the mask for objects of different scales can effectively combine context information to reduce the loss of details, thereby achieving high-quality pixel-level video object segmentation.
Keywords
|