Current Issue Cover
时空双仿射微分不变量及骨架动作识别

李琪1,2, 墨瀚林1,2, 赵婧涵1,2, 郝宏翔1,2, 李华1,2(1.中国科学院计算技术研究所智能信息处理重点实验室, 北京 100190;2.中国科学院大学, 北京 100049)

摘 要
目的 人体骨架的动态变化对于动作识别具有重要意义。从关节轨迹的角度出发,部分对动作类别判定具有价值的关节轨迹传达了最重要的信息。在同一动作的每次尝试中,相应关节的轨迹一般具有相似的基本形状,但其具体形式会受到一定的畸变影响。基于对畸变因素的分析,将人体运动中关节轨迹的常见变换建模为时空双仿射变换。方法 首先用一个统一的表达式以内外变换的形式将时空双仿射变换进行描述。基于变换前后轨迹曲线的微分关系推导设计了双仿射微分不变量,用于描述关节轨迹的局部属性。基于微分不变量和关节坐标在数据结构上的同构特点,提出了一种通道增强方法,使用微分不变量将输入数据沿通道维度扩展后,输入神经网络进行训练与评估,用于提高神经网络的泛化能力。结果 实验在两个大型动作识别数据集NTU(Nanyang Technological University)RGB+D(NTU 60)和NTU RGB+D 120(NTU 120)上与若干最新方法及两种基线方法进行比较,在两种实验设置(跨参与者识别与跨视角识别)中均取得了明显的改进结果。相比于使用原始数据的时空图神经卷积网络(spatio-temporal graph convolutional networks,ST-GCN),在NTU 60数据集中,跨参与者与跨视角的识别准确率分别提高了1.9%和3.0%;在NTU 120数据集中,跨参与者与跨环境的识别准确率分别提高了5.6%和4.5%。同时对比于数据增强,基于不变特征的通道增强方法在两种实验设置下都能有明显改善,更为有效地提升了网络的泛化能力。结论 本文提出的不变特征与通道增强,直观有效地综合了传统特征和深度学习的优点,有效提高了骨架动作识别的准确性,改善了神经网络的泛化能力。
关键词
Spatio-temporal dual affine differential invariants for skeleton-based action recognition

Li Qi1,2, Mo Hanlin1,2, Zhao Jinghan1,2, Hao Hongxiang1,2, Li Hua1,2(1.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;2.University of Chinese Academy of Sciences, Beijing 100049, China)

Abstract
Objective Skeleton-based action recognition has been concerned in recent years, as the dynamics of human skeletons has significant information for the task of action recognition. The action of human skeletons can be seen as time series of human poses, or the combination of human joint trajectories. The trajectory of important joints indicating the action class has conveyed the most significant information among all the human joints. The trajectories of these joints have been subjected to some distortions when performing the same action under different attempts. In this case, two similar trajectories of corresponding joints should share a basic shape. However, these two trajectories have appeared in diverse kinds of distortions due to individual factors. These distortions have been caused by spatial and temporal factors. Spatial factors have included the change of viewpoints, different skeleton sizes and action amplitudes, while temporal factors indicate time scaling along the time series, denoting the order and speed of performing specific action. All the spatial factors can be modeled by the affine transformation in 3D space, whereas the uniform time scaling has been commonly discussed case, which can be seen as affine transformation in 1D space. These two kinds of distortions as the spatio-temporal dual affine transformation have been combined. A novel invariant feature under these distortions has been proposed and utilized for facilitating skeleton-based action recognition. A kind of feature invariant based on the spatio-temporal affine transformation has aided the identification of similar trajectories to be beneficial for action recognition. Method A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The rational polynomial of derivatives of joint trajectories to obtain the invariants has been utilized in detail via eliminating the transformation parameters effectively. Robust, coordinate-system-independent feature has calculated directly from the 3D coordinates. Bounding the degree of polynomial and the order of derivatives, we generate 8 independent STDADIs and combine them as an invariant vector at each moment for each human joint. Moreover, an intuitive and effective method called channel augmentation has been proposed to extend input data with STDADI along the channel dimension for training and evaluation. Specifically, the coordinate vector and the STDADI vector at each joint for each frame have been concatenated. Channel augmentation has introduced invariant information into input data without changing the inner structure of neural networks. The spatio-temporal graph convolutional networks (ST-GCN) as the basic network have been used. The skeleton data modeling as a graph structure has envolved spatial and temporal connections between human joints simultaneously. Particularly, it has exploited local pattern and correlation from human skeletons. In other words, the importance of joints along the action sequence has been expressed as the weights of human joints in the spatio-temporal graph. This is in line with our STDADI, because both of them focus on describing joint dynamics, and our features further provide an invariant expression which is not affected by the distortions. Result The synthetic data has been examined to verify the effectiveness of STDADI as well as the large-scale action recognition dataset. First, 3D spiral line and selected joint trajectory based on NTU-RGB+D applied with random transformation parameters has shown that STDADI is invariant under the spatio-temporal affine transformations. Next, the effectiveness of the proposed feature and method has been validated on the large-scale action recognition dataset NTU(Nanyang Technological University)RGB+D (NTU 60) and its extended version NTU-RGB+D 120 (NTU 120), which is currently the largest dataset with 3D joint annotations captured in a constrained indoor environment, and perform some detailed study to examine the contributions of STDADI. A data augmentation technique as well as the original ST-GCN have been as the baseline methods. The data augmentation technique has involved rotation, scaling and shear transformations of 3D skeletons. The same training strategy and hyper-parameters as the original ST-GCN have been used. ST-GCN + channel augmentation has performed well. Compared with the ST-GCN using raw data, in NTU 60, the cross-subject and cross-view recognition accuracy has been increased by 1.9% and 3.0%, respectively; in NTU 120, the cross-subject and cross-setup recognition accuracy has increased by 5.6% and 4.5% respectively. As it is mainly consisted of 3D geometric transformations, the accuracy in cross-view recognition has been much improved but contributes little to the cross-subject setting for data augmentation. The spatio-temporal dual affine transformation assumption has been validated on both evaluation criteria. Conclusion A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The effectiveness of this invariant feature using a channel augmentation technique has been proved on the large-scale action recognition dataset NTU-RGB+D and NTU-RGB+D 120. The combination of hand-crafted features and data-driven methods has improved the accuracy and generalization well.
Keywords

订阅号|日报