Current Issue Cover
类别敏感的全局时序关联视频动作检测

王东祺, 赵旭(上海交通大学自动化系,上海 200240)

摘 要
目的 视频动作检测是视频理解领域的重要问题,该任务旨在定位视频中动作片段的起止时刻并预测动作类别。动作检测的关键环节包括动作模式的识别和视频内部时序关联的建立。目前主流方法往往试图设计一种普适的检测算法以定位所有类别的动作,忽略了不同类别间动作模式的巨大差异,限制了检测精度。此外,视频内部时序关联的建立对于检测精度至关重要,图卷积常用于全局时序建模,但其计算量较大。针对当前方法的不足,本文提出动作片段的逐类检测方法,并借助门控循环单元以较低的计算代价有效建立了视频内部的全局时序关联。方法 动作模式识别方面,首先对视频动作进行粗略分类,然后借助多分支的逐类检测机制对每类动作进行针对性检测,通过识别视频局部特征的边界模式来定位动作边界,通过识别动作模式来评估锚框包含完整动作的概率;时序建模方面,构建了一个简洁有效的时序关联模块,利用门控循环单元建立了当前时刻与过去、未来时刻间的全局时序关联。上述创新点整合为类别敏感的全局时序关联视频动作检测方法。结果 为验证本文方法的有效性,使用多种视频特征在两个公开数据集上进行实验,并与其他先进方法进行比较。在ActivityNet-1.3数据集中,该方法在双流特征下的平均mAP(mean average precision)达到35.58%,优于其他现有方法;在THUMOS-14数据集中,该方法在多种特征下的指标均取得了最佳性能。实验结果表明,类别敏感的逐类检测思路和借助门控循环单元的时序建模方法有效提升了视频动作检测精度。此外,提出的时序关联模块计算量低于使用图卷积建模的其他主流模型,且具备一定的泛化能力。结论 提出了类别敏感的全局时序关联视频动作检测模型,实现了更为细化的逐类动作检测,同时借助门控循环单元设计了时序关联模块,提升了视频动作检测的精度。
关键词
Class-aware network with global temporal relations for video action detection

Wang Dongqi, Zhao Xu(Department of Automation, Shanghai Jiao Tong University,Shanghai 200240, China)

Abstract
Objective Video-based actions understanding has been concerned more under the huge number of internet videos circumstances. As a significant task of video understanding, temporal action detection (TAD) aims at locating the boundary of each action instance and classifying its class label in untrimmed videos. Inspired by the success of object detection, two-stage pipeline dominates the field of TAD: the first stage generates candidate action segments (proposals), which are then labelled with certain classes in the second stage. Overall, performance of TAD largely depends on two aspects: recognizing action patterns and exploring temporal relations. 1) Current methods usually try to recognize the start and end patterns to locate action boundaries, and patterns between boundaries contribute to predicting confidence score of each segment. 2) Much more temporal relations are vital for accurate detection because information in video is closely related temporally, and a broader receptive field helps model to understand context and semantic relations of the whole video. However, existing methods have limitations on these two aspects. In terms of pattern recognition, almost all methods force the model to cater for all kinds of actions (Class-Agnostic), which means that a universal pattern has to be summarized to locate every action’s start, end and actionness. This method has challenged for varied patterns dramatically with action classes. As for temporal relations, graph convolution network prevails recently to model temporal relations in video, but this method is computationally costly. Method We develop a class-aware network (CAN) with global temporal relations to tackle these two problems and there are two crucial designs in CAN. 1) Different action classes should be treated differently. The model can recognize patterns of various classes unambiguously by this way. Class-aware mechanism (CAM) is embedded into the detection pipeline. It includes several action branches and a universal branch. Each action branch takes charge of one specified class and the universal branch supplies complementary information for more accurate detection. After obtaining a sketchy and general action label of raw video from a video-level classifier, the corresponding action branch of this label in CAM is activated to generate predictions. 2) Gate recurrent unit (GRU)-assisted ternary basenet (TB) is designed to explore temporal relations more effectively. Considering the whole video feature sequence is accessible in offline TAD task, by changing the input order of features, GRU can not only memorize the existed features but also forecast future information gathering. In TB, temporal features are combined simultaneously, so the receptive field of model is not restricted locally but bidirectional extended to the past and future, and thus the video-based global-temporal-relations are built in. Result Our experiments are carried out on two benchmarks: ActivityNet-1.3 and THUMOS-14. 1) The THUMOS-14 consists of 200 temporally annotated videos in validation set and 213 videos in testing set. A sum of 20 action categories is included. 2) The ActivityNet-1.3 contains 19 994 temporally annotated videos with 200 action classes. Furthermore, the hierarchical structure of all classes is accessible in annotation. The comparative analysis has been conducted as well. 1) On THUMOS-14, the CAN improves the average mean average precision (mAP) to 54.90%. 2) On ActivityNet-1.3, average mAP of CAN is 35.58%, which is higher than its baseline 33.85% and is improved 35.52%. Additionally, ablation experiments demonstrate the effectiveness of our method. Class-aware mechanism and TB contributes to the detection accuracy both. And, TB can build the global-temporal relations effectively with low computational cost compared to graph model, designed by sub-graph localization for temporal action detection (GTAD). Conclusion Our research tends to reveal two key aspects in temporal action detection task: 1) recognize action patterns and 2) exploring temporal relations. The class-aware mechanism (CAM) is designed to detect action segments of different classes rationally and accurately. Moreover, TB provides an effective way to explore temporal relations at frame level. These two ways are integrated into one framework named class-aware network (CAN) with global temporal relations, and has its optimization results on two benchmarks.
Keywords

订阅号|日报