Current Issue Cover
增强二阶网络调制的目标跟踪

王献海1,2, 宋慧慧1,2, 张开华1,2, 刘青山1,2(1.南京信息工程大学大气环境与装备技术协同创新中心, 南京 210044;2.江苏省大数据分析技术重点实验室, 南京 210044)

摘 要
目的 表观模型对视觉目标跟踪的性能起着决定性的作用。基于网络调制的跟踪算法通过构建高效的子网络学习参考帧目标的表观信息,以用于测试帧目标的鲁棒匹配,在多个目标跟踪数据集上表现优异。但是,这类跟踪算法忽视了高阶信息对鲁棒建模物体表观的重要作用,致使在物体表观发生大尺度变化时易产生跟踪漂移。为此本文提出全局上下文信息增强的二阶池化调制子网络,以学习高阶特征提升跟踪器的性能。方法 首先,利用卷积神经网络(convolutional neural networks,CNN)提取参考帧和测试帧的特征;然后,对提取的特征采用不同方向的长短时记忆网络(long shot-term memory networks,LSTM)捕获每个像素的全局上下文信息,再经过二阶池化网络提取高阶信息;最后,通过调制机制引导测试帧学习最优交并比预测。同时,为提升跟踪器的稳定性,在线跟踪通过指数加权平均自适应更新物体表观特征。结果 实验结果表明,在OTB100(object tracking benchmark)数据集上,本文方法的成功率为67.9%,超越跟踪器ATOM (accurate tracking by overlap maximization)1.5%;在VOT (visual object tracking)2018数据集上平均期望重叠率(expected average overlap,EAO)为0.44,超越ATOM 4%。结论 本文通过构建全局上下文信息增强的二阶池化调制子网络来学习高效的表观模型,使跟踪器达到目前领先的性能。
关键词
Object tracking using enhanced second-order network modulation

Wang Xianhai1,2, Song Huihui1,2, Zhang Kaihua1,2, Liu Qingshan1,2(1.Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China;2.Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing 210044, China)

Abstract
Objective An appearance model plays a key role in the performance of visual object tracking. In recent years, tracking algorithms based on network modulation learn an appearance model by building an effective subnetwork, and thus,they can more robustly match the target in the search frames. The algorithms exhibite xcellent performance in many object tracking benchmarks. However, these tracking methods disregard the importance of high-order feature information, causing a drift when large-scale target appearance occurs. This study utilizes a global contextual attention-enhanced second-order network to model target appearance.This network is helpful in enhancing nonlinear modeling capability in visual tracking. Method The tracker includes two components:target estimation and classification components. It can be regarded as a two-stage tracker. Combined with the method based on Siamese networks, the speed of this method is relatively slow. The target estimation component is trained off-line to predict the overlapping of the target and the estimated bounding boxes. This tracker presents an effective network architecture for visual tracking.This architecture includes two novel module designs. The first design is called pixel-wisely global contextual attention (pGCA), which leverages bidirectional long short-term memory(Bi-LSTM) to sweep row-wisely and column-wisely across feature maps and fully capture the global context information of each pixel. The other design is second-order pooling modulation (SPM), which uses the feature covariance matrix of the template frame to learn a second-order modulation vector. Then, the modulation vector channel-wisely multiplies the intermediate feature maps of the query image to transfer the target-specific information from the template frame to the query frame. In addition, this study selects the widely adopted ResNet-50 as our backbone network.This network is pretrained on ImageNet classification task. Given the input template image X0 with bounding box b0 and query image X, this studyselects the feature maps of the third and fourth layers for subsequent processing. The feature maps are fed into the pGCA module and the precise region of interest pooling (PrPool) module, which are used to obtain the features of the annotation area.The maps are then concatenated to yield the multi-scale features enhanced by global context information. Moreover, to handle them is aligned feature caused by the large-scale deformation between the query and the template images, the tracker injects two deformable convolution blocks into the bottom branch for feature alignment. Then, the fused feature is passed through two branches of SPM, generating two modulation vectors that channel-wisely multiply the corresponding feature layers on the bottom branch of the search frame. The fused feature is more helpful to the performance of the tracker via network modulation instead of a correlation in Siamese networks. Thereafter, the modulated features are fed into two PrPool layers and then concatenated. The output features are finally fed into the intersection over union predictor module that is composed of three fully connected layers. Given the annotated ground truth, the tracker minimizes the estimation error to train all the network parameters in an end-to-end manner. The classification component is a two-layer full convolutional neural network. In contrast with the estimation component, it trains online to predict a target confidence score. Thus,this component can provide a rough 2D location of the object. During online learning, the objective function is optimized using the conjugate gradient method instead of stochastic gradient descent for real-time tracking. For the robustness of the tracker, this study uses an averaging strategy to update object appearance in this component.This strategy has been widely been used in discriminative correlation filters. For this strategy, this study assumes that the appearance of the object changes smoothly and consistently in succession. Simultaneously, it the strategy can fully utilize the information of the previous frame. The overall tracking process involves using the classification to obtain a rough location of the target, which is a response map with dimensions of 14×14×1. This tracker can distinguish the specific foreground and background in accordance with the response map. Gaussian sampling is used to obtain some predicted target bounding boxes. Before selecting which predicted bounding box is the tracking result, the tracker trains the estimation component off-line. The predicted bounding boxes are fed to the estimation component. The highest score in the estimation component determines which box is the tracking result. Result The tracker validates the effectiveness and robustness of the proposed method on the OTB100(object tracking benchmark) and the challenging VOT2018(visual object tracking) datasets. The proposed method achieves the best performance in terms of success plots and precision plots with an area under the curve (AUC) score of 67.9% and a precision score 87.9%, outperforming the state-of-the-art ATOM(accurate tracking by overlap maximization) by 1.5% in terms of AUC score.Simultaneously, the expected average overlap (EAO) score of our method ranks first, with 0.441 1, significantly outperforming the second best-performing method ATOM by 4%, with an EAO score of 0.401 1. Conclusion This study proposes a visual tracker that uses network modulation.This tracker includes pGCA and SPM modules. The pGCA module leverages Bi-LSTM to capture the global context information of each pixel.The SPM module uses the feature covariance matrix of the template frame to learn a second-order modulation vector to model target appearance. It reduces the information loss of the first frame and enhances the correlation between features. The tracker utilizes an averaging strategy to update object appearance in the classification component for robustness. Therefore,the proposed tracker significantly outperforms state-of-the-art methods in terms of accuracy and efficiency.
Keywords

订阅号|日报