时域候选优化的时序动作检测
摘 要
目的 时序动作检测(temporal action detection)作为计算机视觉领域的一个热点课题,其目的是检测视频中动作发生的具体区间,并确定动作的类别。这一课题在现实生活中具有深远的实际意义。如何在长视频中快速定位且实现时序动作检测仍然面临挑战。为此,本文致力于定位并优化动作发生时域的候选集,提出了时域候选区域优化的时序动作检测方法TPO (temporal proposal optimization)。方法 采用卷积神经网络(convolutional neural network,CNN)和双向长短期记忆网络(bidirectional long short term memory,BLSTM)来捕捉视频的局部时序关联性和全局时序信息;并引入联级时序分类优化(connectionist temporal classification,CTC)方法,评估每个时序位置的边界概率和动作概率得分;最后,融合两者的概率得分曲线,优化时域候选区域候选并排序,最终实现时序上的动作检测。结果 在ActivityNet v1.3数据集上进行实验验证,TPO在各评价指标,如一定时域候选数量下的平均召回率AR@100(average recall@100),曲线下的面积AUC (area under a curve)和平均均值平均精度mAP (mean average precision)上分别达到74.66、66.32、30.5,而各阈值下的均值平均精度mAP@IoU (mAP@intersection over union)在阈值为0.75和0.95时也分别达到了30.73和8.22,与SSN (structured segment network)、TCN (temporal context network)、Prop-SSAD (single shot action detector for proposal)、CTAP (complementary temporal action proposal)和BSN (boundary sensitive network)等方法相比,TPO的所有性能指标均有提高。结论 本文提出的模型兼顾了视频的全局时序信息和局部时序信息,使得预测的动作候选区域边界更为准确和灵活,同时也验证了候选区域的准确性能够有效提高时序动作检测的精确度。
关键词
Temporal proposal optimization for temporal action detection
Xiong Chengxin, Guo Dan, Liu Xueliang(School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China) Abstract
Objective With the ubiquity of electronic equipment, such as cellphones and cameras, massive video data of people's activities and behaviors in daily life are stored, recorded, and transmitted. Increasing video-based applications, such as video surveillance, have attracted the attention of researchers. However, real-world videos are consistently long and untrimmed. Long untrimmed videos in publicly available datasets for temporal action detection consistently contain several ambiguous frames and a large number of background frames. Accurately locating action proposals and recognizing action labels are difficult. Similar to object proposal generation in object detection task, the task of temporal action detection can be resolved into two phases, where the first phase is to determine the specific durations (starting and ending timestamps) of actions, and the second phase is to identify the category of each action instance. The development of single-action classification in trimmed videos has been extremely successful, whereas the performance of temporal action proposal generation remains unsatisfactory. The phase of candidate action proposal generation experiences time-consuming model training. High-quality proposals contribute to the performance of action detection. The study on temporal proposal generation can effectively and efficiently locate the video content and facilitate video understanding in untrimmed videos. In this work, we focus on the optimization of temporal action proposals for action detection. Method We aim to improve the performance of action detection by optimizing temporal action proposals, that is, accurately localizing the boundaries of actions in long untrimmed videos. We propose a temporal proposal optimization (TPO) model for the detection of candidate action proposals. TPO utilizes the advantages of convolutional neural networks (CNNs) and bidirectional long short-term memory (BLSTM) to simultaneously capture the local and global temporal cues. In the proposed TPO model, we introduce connectionist temporal classification (CTC) optimization, which excels at parsing global feature-level classification labels. The global actionness probability calculated by BLSTM and CTC modifies several inexact temporal cues in the local CNN actionness probability. Thus, a probability fusion strategy based on local and global actionness probabilities promotes the accuracy of temporal boundaries of actions in videos and results in the promising performance of temporal action detection. In particular, TPO is composed of three modules, namely, local actionness evaluation module (LAEM), global actionness evaluation module (GAEM), and post processing module (PPM). The extracted features are fed into LAEM and GAEM. Then, LAEM and GAEM generate the global and local actionness probabilities along the temporal dimension, respectively. LAEM is a temporal CNN-based module, and GAEM predicts the global actionness probabilities with the help of BLSTM and CTC losses. LAEM outputs three sequences. Starting and ending probabilities are found in addition to local actionness probabilities. The crossing of starting and ending probability curves builds the candidate temporal proposals. Thus, GAEM captures global actionness probabilities, which is auxiliary to LAEM. Then, the local and global actionness probabilities are fed into PPM to obtain a fused actionness probability curve. Subsequently, we sample the actionness probability curves through linear interpolation to extract proposal-level features. The proposal-level features are fed int a multilayer perceptron) to obtain the confidence score. We use the confidence score to rank the candidate proposals and adopt soft-NMS(non-maximum supression) to remove redundant proposals. Finally, we apply an existing classification model with our generated proposals to evaluate the detection performance of TPO. Result We validate the proposed model on two evaluations of action proposal generation and action detection. Experimental results indicate that TPO outperforms other state-of-the-art methods on ActivityNet v1.3 dataset. For the proposal generation, we compare our model with the methods, including SSN(structured segment network), TCN(temporal context network), Prop-SSAD(single shot action detector for proposal), CTAP(complementary temporal action proposal), and BSN(boundary sensitive network). The proposed TPO model performs best and achieves average recall @ average number of proposals of 74.66 and area under a curve of 66.32. For the temporal action detection task, we test the quantitative evaluation metric mean average precision@intersection over union (mAP@IoU). Compared with the existing methods, including SCC(semantic cascade context), CDC(convolutional-de-convolutional), SSN and BSN, TPO achieves the best mAPs of 30.73 and 8.22 under the tIoUs of 0.75 and 0.95, respectively, and obtains the best average mAP of 30.5. Notably, the mAP value decreases with the increase in tIoU value. The tIoU metric reflects the overlap between the generated proposals and the ground truth, where a high tIoU value indicates strict constraints on candidate proposals. Thus, TPO achieves the best mAP performance under high tIoU values (0.75 and 0.95). This result validates the detection performance. TPO generates accurate proposals of action instances with high overlap on the ground truth and improves the detection performance. Conclusion In this paper, we propose a novel model called TPO for temporal proposal generation that achieves promising performance on ActivityNet v1.3 to resolve the action detection problem. Experimental results demonstrate the effectiveness of TPO. TPO generates temporal proposals with precise boundaries and maintains flexible temporal durations, thereby covering sequential actions in videos with variable-length intervals.
Keywords
temporal action detection temporal action proposals actionness probability connectionist temporal classification (CTC) convolutional neural network (CNN) bidirectional long short term memory (BLSTM)
|