Current Issue Cover
用于骨架行为识别的多维特征嵌合注意力机制

姜权晏, 吴小俊, 徐天阳(江南大学人工智能与计算机学院, 无锡 214122)

摘 要
目的 在行为识别任务中,妥善利用时空建模与通道之间的相关性对于捕获丰富的动作信息至关重要。尽管图卷积网络在基于骨架信息的行为识别方面取得了稳步进展,但以往的注意力机制应用于图卷积网络时,其分类效果并未获得明显提升。基于兼顾时空交互与通道依赖关系的重要性,提出了多维特征嵌合注意力机制(multi-dimensional feature fusion attention mechanism,M2FA)。方法 不同于现今广泛应用的行为识别框架研究理念,如卷积块注意力模块(convolutional block attention module,CBAM)、双流自适应图卷积网络(two-stream adaptive graph convolutional network,2s-AGCN)等,M2FA通过嵌入在注意力机制框架中的特征融合模块显式地获取综合依赖信息。对于给定的特征图,M2FA沿着空间、时间和通道维度使用全局平均池化操作推断相应维度的特征描述符。特征图使用多维特征描述符的融合结果进行过滤学习以达到细化自适应特征的目的,并通过压缩全局动态信息的全局特征分支与仅使用逐点卷积层的局部特征分支相互嵌合获取多尺度动态信息。结果 实验在骨架行为识别数据集NTU-RGBD和Kinetics-Skeleton中进行,分析了M2FA与其基线方法2s-AGCN及最新提出的图卷积模型之间的识别准确率对比结果。在Kinetics-Skeleton验证集中,相比于基线方法2s-AGCN,M2FA分类准确率提高了1.8%;在NTU-RGBD的两个不同基准分支中,M2FA的分类准确率比基线方法2s-AGCN分别提高了1.6%和1.0%。同时,消融实验验证了多维特征嵌合机制的有效性。实验结果表明,提出的M2FA改善了图卷积骨架行为识别方法的分类效果。结论 通过与基线方法2s-AGCN及目前主流图卷积模型比较,多维特征嵌合注意力机制获得了最高的识别精度,可以集成至基于骨架信息的体系结构中进行端到端的训练,使分类结果更加准确。
关键词
M2FA: multi-dimensional feature fusion attention mechanism for skeleton-based action recognition

Jiang Quanyan, Wu Xiaojun, Xu Tianyang(School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China)

Abstract
Objective The contexts of action analysis and recognition is challenged for a number of applications like video surveillance, personal assistance, human-machine interaction, and sports video analysis. Thanks to the video-based action recognition methods, an skeleton data based approach has been focused on recently due to its complex scenarios. To locate the 2D or 3D spatial coordinates of the joints, the skeleton data is mainly obtained via depth sensors or video-based pose estimation algorithms. Graph convolutional networks (GCNs) have been developed to resolve the issue in terms of the traditional methods cannot capture the completed dependence of joints with no graphical structure of skeleton data. The critical viewpoint is challenged to determine an adaptive graph structure for the skeleton data at the convolutional layers. The spatio-temporal graph convolutional network (ST-GCN) has been facilitated to learn spatial and temporal features simultaneously through the temporal edges plus between the corresponding joints of the spatial graph in consistent frames. However, ST-GCN focuses on the physical connection between joints of the human body in the spatial graph, and ignores internal dependencies in motion. Spatio-temporal modeling and channel-wise dependencies are crucial for capturing motion information in videos for the action recognition task. Despite of the credibility in skeleton-based action recognition of GCNs, the relative improvement of classical attention mechanism applications has been constrained. Our research highlights the importance of spatio-temporal interactions and channel-wise dependencies both in accordance with a novel multi-dimensional feature fusion attention mechanism (M2FA). Method Our proposed model explicitly leverages comprehensive dependency information by feature fusion module embedded in the framework, which is differentiated from other action recognition models with additional information flow or complicated superposition of multiple existing attention modules. Given medium feature maps, M2FA infers the feature descriptors on the spatial, temporal and channel scales sequentially. The fusion of the feature descriptors filters the input feature maps for adaptive feature refinement. As M2FA is being a lightweight and general module, it can be integrated into any skeleton-based architecture seamlessly with end-to-end trainable attributes following the core recognition methods. Result To verify its effectiveness, our algorithm is validated and analyzed on two large-scale skeleton-based action recognition datasets:NTU-RGBD and Kinetics-Skeleton. Our experiments are carried out ablation studies to demonstrate the advantages of multi-dimensional feature fusion on the two datasets. Our analyses demonstrate the merit of M2FA for skeleton-based action recognition. On the Kinetics-Skeleton dataset, the action recognition rate of the proposed algorithm is 1.8% higher than that of the baseline algorithm (2s-AGCN). On cross-view benchmark of NTU-RGBD dataset, the human action recognition accuracy of the proposed method is 96.1%, which is higher than baseline method. In addition, the action recognition rate of the proposed method is 90.1% on cross-subject benchmark of NTU-RGBD dataset. We showed that the skeleton-based action recognition model, known as 2s-AGCN, can be significantly improved in terms of accuracy based on adaptive attention mechanism incorporation. Our multi-dimensional feature fusion attention mechanism, called M2FA, captures spatio-temporal interactions and interconnections between potential channels. Conclusion We developed a novel multi-dimensional feature fusion attention mechanism (M2FA) that captures spatio-temporal interactions and channel-wise dependencies at the same time. Our experimental results show consistent improvements in classification and its priorities of M2FA.
Keywords

订阅号|日报