Current Issue Cover
面向群体行为识别的非局部网络模型

李定1,2, 马静1, 杨萌林2, 张文生2(1.哈尔滨理工大学自动化学院, 哈尔滨 150001;2.中国科学院自动化研究所, 北京 100190)

摘 要
目的 视频行为识别一直广受计算机视觉领域研究者的关注,主要包括个体行为识别与群体行为识别。群体行为识别以人群动作作为研究对象,对其行为进行有效表示及分类,在智能监控、运动分析以及视频检索等领域有重要的应用价值。现有的算法大多以多层递归神经网络(RNN)模型作为基础,构建出可表征个体与所属群体之间关系的群体行为特征,但是未能充分考虑个体之间的相互影响,致使识别精度较低。为此,提出一种基于非局部卷积神经网络的群体行为识别模型,充分利用个体间上下文信息,有效提升了群体行为识别准确率。方法 所提模型采用一种自底向上的方式来同时对个体行为与群体行为进行分层识别。首先从原始视频中沿着个人运动的轨迹导出个体附近的图像区块;随后使用非局部卷积神经网络(CNN)来提取包含个体间影响关系的静态特征,紧接着将提取到的个体静态特征输入多层长短期记忆(LSTM)时序模型中,得到个体动态特征并通过个体特征聚合得到群体行为特征;最后利用个体、群体行为特征同时完成个体行为与群体行为的识别。结果 本文在国际通用的Volleyball Dataset上进行实验。实验结果表明,所提模型在未进行群体精细划分条件下取得了77.6%的准确率,在群体精细划分的条件下取得了83.5%的准确率。结论 首次提出了面向群体行为识别的非局部卷积网络,并依此构建了一种非局部群体行为识别模型。所提模型通过考虑个体之间的相互影响,结合个体上下文信息,可从训练数据中学习到更具判别性的群体行为特征。该特征既包含个体间上下文信息、也保留了群体内层次结构信息,更有利于最终的群体行为分类。
关键词
Nonlocal based deep model for group activity recognition

Li Ding1,2, Ma Jing1, Yang Menglin2, Zhang Wensheng2(1.School of Automation, Harbin University of Science and Technology, Harbin 150001, China;2.Institute of Automation, Chinese Academy of Science, Beijing 100190, China)

Abstract
Objective Human action recognition, which is composed of single-person action and group activity recognition, has received considerable research attention. Group activity recognition is based on single-person action recognition and focuses on the group of people in the scene. This type of recognition has various applications, including video surveillance, sport analytics, and video retrieval. In group activity recognition, the hierarchical structure between the group and individuals is significant to recognition, and the main challenge is to build more discriminative representations of group activity based on the hierarchical structure. To overcome this difficulty, researchers have proposed numerous methods. Hierarchical framework is widely adopted to represent the relationships between individuals and their corresponding group and has achieved promising performance. In the early years, hand-crafted features are designed as the representations of individual and group-level activities. Recently, deep learning has been widely used in group activity recognition. Typically, hierarchical framework-based RNN (recurrent neural network) has been adopted to represent the relationships between individuals and their corresponding group and has achieved promising performance. Despite the promising performance, these methods ignore the relationships and interactions among individuals, thereby affecting the accuracy of recognition. Group activity is comprehensively defined by each individual action and the contextural information among individuals. Extracting individual features in isolation results in the loss of contextural information. To address this problem, we propose a novel model for group activity recognition based on the nonlocal network. Method The proposed model utilizes a bottom-up approach to represent and recognize individual actions and group activities in a hierarchical manner. First, tracklets of multi-person are constructed based on the detection and trajectories, and static features are extracted from these tracklets by nonlocal convolutional neural network (NCNN). Inside the NCNN module, the similarity of each individual is calculated to capture the nonlocal context within the individuals. The extracted features are then fed into the hierarchical temporal model (HTM), which is based on LSTM (long short term memory). HTM is composed of individual-level LSTM and group-level LSTM, which focuses on group dynamics in a hierarchical manner. Dynamic features of individuals are extracted, and features of group activities are generated by aggregating individual features in the HTM. Finally, the group activities and individual actions are classified by utilizing the output of HTM. The entire framework is easily implemented in with end-to-end training style. Result We evaluate our model on the widely-used The Volleyball Dataset in two different dataset settings, namely, fine-division and non-fine-division. Fine-division experimental settings refer to the group as combination of different subgroups, and a subgroup is composed of several individuals. In this setting, the structure of the group is "group-subgroup-individuals". We aggregate the individual features within the subgroup and then concatenate the features of subgroups. Non-fine-division experimental setting means the lack of involvement of subgroup. We aggregate all the individual features to generate the features of the group. Experimental results show that the proposed method can achieve 83.5% accuracy in fine-division manner and 77.6% accuracy in non-fine-division manner. Examples of recognition and relationships within the group are visualized. Conclusion This study proposes a novel neural network for group activity recognition and constructs a unified framework based on the NCNN and hierarchical LSTM network. We address the motivation of taking the relationships among individuals into consideration with a nonlocal network and utilize the contextural information in the group. In extracting individual features, the method learns more discriminative features, which combine the impact of each individual. Thus, contextural information in nonlocal area is embedded into the extracted features. Experimental results confirm the effectiveness of our nonlocal model, indicating that the contextural information between individuals and the hierarchical structure of the group facilitate the group activity recognition.
Keywords

订阅号|日报