采用蒸馏训练的时空图卷积动作识别融合模型
摘 要
目的 基于深度学习的动作识别方法识别准确率显著提升,但仍然存在很多挑战和困难。现行方法在一些训练数据大、分类类别多的数据集以及实际应用中鲁棒性较差,而且许多方法使用的模型参数量较大、计算复杂,提高模型准确度和鲁棒性的同时对模型进行轻量化仍然是一个重要的研究方向。为此,提出了一种基于知识蒸馏的轻量化时空图卷积动作识别融合模型。方法 改进最新的时空卷积网络,利用分组卷积等设计参数量较少的时空卷积子模型;为了训练该模型,选取两个现有的基于全卷积的模型作为教师模型在数据集上训练,在得到训练好的教师模型后,再利用知识蒸馏的方法结合数据增强技术训练参数量较少的时空卷积子模型;利用线性融合的方法将知识蒸馏训练得到的子模型融合得到最终的融合模型。结果 在广泛使用的NTU RGB + D数据集上与前沿的多种方法进行了比较,在CS(cross-subject)和CV(cross-view)两种评估标准下,本文模型的准确率分别为90.9%和96.5%,与教师模型2s-AGCN(two-stream adaptive graph convolutional networks for skeleton-based action)相比,分别提高了2.4%和1.4%;与教师模型DGNN(directed graph neural network)相比,分别提高了1.0%和0.4%;与MS-AAGCN(multi-stream attention-enhanced adaptive graph convolutional neural network)模型相比,分别提高了0.9%和0.3%。结论 本文提出的融合模型,综合了知识蒸馏、数据增强技术和模型融合的优点,使动作识别更加准确和鲁棒。
关键词
Action recognition using ensembling of different distillation-trained spatial-temporal graph convolution models
Yang Qingshan, Mu Taijiang(Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China) Abstract
Objective Skeleton-based action recognition, which is an intensively studied field, aims to classify human actions represented by a sequence of selected key points of the human body into action categories. Skeleton-based action recognition has a wide variety of applications, including human-computer interaction, elderly monitoring, and video understanding. Recognition accuracy has improved significantly in recent years due to the development of deep learning. However, few studies have focused on the number of parameters and the robustness of the model. In previous skeleton-based action recognition methods, convolution with a big kernel size has been used to extract spatial and temporal features for the broad receptive field, leading to an increase in model parameters and more complicated calculations. Many previous studies have confirmed that graph convolution has better performance in skeletal data. However, graph convolution operators are designed manually, and their versatility and robustness are insufficient. Therefore, we hope to design a lightweight temporal convolution module that preserves the large receptive field for temporal feature learning. In the spatial dimension, we aimed for better robustness of the spatial convolution module constructed using two kinds of graph convolutions. We will improve the performance of the model with the help of data enhancement technology to increase the diversity of input data and improve the generalization ability of the model for different perspectives. To this end, a distillation training method that can improve the accuracy of lightweight models is used for model training and a multi-stream spatiotemporal graph convolutional ensemble model is constructed to improve the current methods and increase the accuracy of the skeleton-based action recognition. Method In this study, we propose a skeleton-based multi-stream ensemble model composed of six sub-networks for action recognition. These sub-networks are divided into two types: directed graph convolutional sub neural network (DGCNNet) and adaptive graph convolutional sub neural network (AGCNNet). Each sub-network is constructed with temporal convolution modules, spatial convolution modules, and attention modules. The temporal convolution module in the sub-network is designed with a 2D depth-wise group convolution layer with a convolution kernel size of 9×1 and a normal convolution layer with a convolution kernel size of 1×1. Two types of graph convolution-directed graph convolution and adaptive graph convolution-are used in the spatial convolution module to extract spatial features and enhance the robustness of the model. Three self-attention modules between the spatial convolution and the temporal convolution modules are applied over the channel dimension, spatial dimension, and temporal dimension of the features to underscore the informative features. We also introduce a cross-modal distillation training method that can be used to train lighter and more accurate student models with trained teacher models and ground truth to train the sub-networks. The distillation training consists of two steps: teacher model training and student model training. The teacher model is trained using training data and its weights are fixed after the training is completed. The student model is trained with the feature vector encoded by the frozen teacher model and the ground-truth labels of training data. Two previous methods, 2s-AGCN(two-stream adaptive graph convolutional networks for skeleton-based action) and directed graph neural networks (DGNN), are used as teacher models to train the DGCNNet sub-network and the AGCNNet sub-network, respectively, according to the type of graph convolution in the spatial convolution module. Cross-entropy loss is used for teacher model training and a combination of mean squared error loss and cross-entropy loss is used as the final loss for student model training. The student model trained using the distillation training method is not only lighter but also more accurate than the corresponding teacher model. In addition to joints, we also take bones and affine-transformation-augmented data as input to train the student model. Finally, a multi-stream spatiotemporal graph convolutional ensemble model is constructed with six lightweight sub-networks, and it has better robustness and higher accuracy. The accuracies of our model for cross-subject benchmark and cross-view benchmark of NTU RGB+D dataset are 90.9% and 96.5%, respectively, higher than many other currently best approaches. Result We compared our model with the other 14 best models thus far on the widely used NTU RGB+D dataset. Our model achieved a 90.9% cross-subject accuracy and a 96.5% cross-view accuracy in terms of the benchmark. A comparison of our model with the teacher model, 2s-AGCN, indicated that the accuracy increased by 2.4% and 1.4%. When compared with another teacher model, DGNN, the accuracy of our model increased by 1.0% and 0.4%; and when compared with the base-line method, spatial temporal graph convolution networks (ST-GCN), the accuracies of our model are 9.4% and 8.2% higher, respectively. In addition, extensive experiments indicated the effectiveness of knowledge distillation on this task and we also explored the effects of the different combinations of input modalities on the final accuracy of the model. Conclusion In this article, we propose a new multi-stream ensemble model that contains six sub-models trained using the distillation training method, and each sub-model is constructed with spatial convolution modules, temporal convolution modules, and attention modules. The results of the experiment indicate that our model outperforms several state-of-the-art skeleton-based action recognition approaches and that the ensemble algorithm can improve the performance.
Keywords
|