Current Issue Cover

卢健, 李萱峰, 赵博, 周健(西安工程大学电子信息学院, 西安 710600)

摘 要
A review of skeleton-based human action recognition

Lu Jian, Li Xuanfeng, Zhao Bo, Zhou Jian(School of Electronics and Information, Xi'an Polytechnic University, Xi'an 710600, China)

Skeleton-based human action recognition aims to correctly analyze the classes of actions from skeleton sequences, which contain one or more actions. Skeleton-based human action recognition has recently emerged as a hot research topic in the field of computer vision. Due to the fact that actions can be used to handle tasks and express human emotions, action recognition can be widely applied in various fields, such as intelligent monitoring systems, humancomputer interaction, virtual reality, and smart healthcare. Compared with RGB-based human action recognition, skeleton-based human action recognition methods are less affected by interference factors, such as background and human appearance, and have higher accuracy and robustness. In addition, these methods require a small amount of data and show a high computational efficiency, thereby increasing their prospects in practical applications. In this case, comprehensively and systematically summarizing and analyzing skeleton-based human action recognition methods become critical. Compared with other reviews on skeleton-based action recognition, our contributions are as follows:we provide a more comprehensive summary of skeleton-based action datasets;we provide a more comprehensive summary of skeleton-based action recognition methods, including the latest Transformer technology;we offer a more instructive classification of graph convolutional methods;and we not only summarize the existing problems but also forecast the prospects for future research. First, we introduce nine datasets that are commonly used for skeleton-based action recognition, including the MSR Action3D, MSR Daily Activity 3D, 3D Action Pairs, SYSU 3DHOI, UTD-MHAD, Northwestern-UCLA, NTU RGB+D 60, Skeleton-Kinetics, and NTU RGB +D 120 datasets. In order to highlight the characteristics of these datasets prominently, we divide them into single-view and multi-view datasets from the data collection perspective and then explore the traits and uses of each category. Second, based on the backbone network used by the models, we categorize the skeletonbased action recognition methods into those based on handcrafted features, based on recurrent neural network(RNN), based on convolutional neural network(CNN), based on graph convolutional network(GCN), and based on Transformer. Before the rise of deep learning methods, traditional algorithms(handcrafted features)were often used to model human skeleton data. The key problem in using such methods is how to create an effective feature representation of human skeleton sequences. However, after the rise of deep learning methods, which demonstrate excellent performance in various fields, such as face recognition, image classification, and image super-resolution, researchers have begun using deep learning networks to model skeleton data. Among them, RNN effectively processes data in the form of continuous time series and is adept at learning temporal dependencies information in sequence data, while CNN can effectively learn high-level semantic information of skeleton data. Training a CNN-based model requires lower computational costs than RNN. Unlike RNNbased methods, before using CNN, the skeleton data should be reshaped into pseudo-images. The columns of the pseudoimage represent the features of all joints in one frame, while the rows represent the features of a certain joint across all frames. However, when RNN or CNN methods are used to model skeleton data, the topological structure of the human skeleton is ignored. Transforming the skeleton data into sequence vectors of joint coordinates or a 2D grid cannot accurately describe the dynamic skeleton of the human body. Previous studies show that graph convolution has a powerful ability to model topological graph structures, making this method particularly suitable for modeling the human skeleton. Given their successful application, graph convolutional methods have been widely used in skeleton-based action recognition. This paper specifically adopts a novel inductive approach and provides a comprehensive review of GCN-based methods. These GCN-based methods are further classified according to the problems targeted in the literature with an aim to provide researchers with additional ideas and methods. These studies can be divided into optimization of the graph structure, network lightweighting, optimization of temporal and spatial features, and optimization of missing and noisy joints. This paper also provides a comprehensive summary of the issues faced by the currently available methods. This paper not only points out the limitations and challenges faced by these methods but also evaluates the future development trend and provides insightful prospects for the field. By doing so, this review not only helps readers gain a deep understanding of the current state of this task but also provides valuable guidance for future research in this area.
