多模态零样本人体动作识别
摘 要
目的 在人体行为识别算法的研究领域,通过视频特征实现零样本识别的研究越来越多。但是,目前大部分研究是基于单模态数据展开的,关于多模态融合的研究还较少。为了研究多种模态数据对零样本人体动作识别的影响,本文提出了一种基于多模态融合的零样本人体动作识别(zero-shot human action recognition framework based on multimodel fusion, ZSAR-MF)框架。方法 本文框架主要由传感器特征提取模块、分类模块和视频特征提取模块组成。具体来说,传感器特征提取模块使用卷积神经网络(convolutional neural network, CNN)提取心率和加速度特征;分类模块利用所有概念(传感器特征、动作和对象名称)的词向量生成动作类别分类器;视频特征提取模块将每个动作的属性、对象分数和传感器特征映射到属性—特征空间中,最后使用分类模块生成的分类器对每个动作的属性和传感器特征进行评估。结果 本文实验在Stanford-ECM数据集上展开,对比结果表明本文ZSAR-MF模型比基于单模态数据的零样本识别模型在识别准确率上提高了4 %左右。结论 本文所提出的基于多模态融合的零样本人体动作识别框架,有效地融合了传感器特征和视频特征,并显著提高了零样本人体动作识别的准确率。
关键词
Multimodal-based zero-shot human action recognition
Lyu Lulu1, Huang Yi2, Gao Junyu2, Yang Xiaoshan2, Xu Changsheng2(1.Zhengzhou University, Zhengzhou 450000, China;2.National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China) Abstract
Objective Human action recognition is one of the research hotspots in computer vision because of its wide application in human-computer interaction, virtual reality, and video surveillance. With the development of related technology in recent years, the human action recognition algorithm based on deep learning has achieved good recognition performance when the sample size is sufficient. However, studying human action recognition is difficult when the sample size is small or missing. The emergence of zero-shot recognition technology has solved these problems and attracted considerable attention because it can directly classify the “unseen” categories that are not in the training set. In the past decade, numerous methods have been conducted to perform zero-shot human action recognition by using video features and achieved promising improvement. However, most of the current methods are based on single modality data and few studies have been conducted on multimodal fusion. To study the influence of multiple modality fusion on zero-shot human action recognition, this study proposes a zero-shot human action recognition framework based on multimodal fusion(ZSAR-MF). Method Unlike most of the previous methods based on the fusion of external information and video features or only research on single-modality video features, tour study focuses on the influence of sensor features that are most related to the activity state to improve the recognition performance. The zero-shot human-action recognition framework based on multimodal fusion is mainly composed of a sensor feature-extraction module, classification module, and video feature extraction module. Specifically, the sensor feature-extraction module uses convolutional neural network (CNN) to extract the acceleration and heart rate features of human actions and predict the most relevant feature words for each action. The classification module uses the word vectors of all concepts (sensor features, actions names, and object names) to generate action category classifiers. The “seen” category classifiers are obtained by learning the training data of these categories, and the “unseen” category classifiers are generalized from the “seen” category classifiers by using graph convolutional network (GCN).The video feature-extraction module extracts the video features of each action and maps the attributes of human actions, object scores, and sensor features into the attribute-feature space. Finally, the classifiers generated by the classification module are used to evaluate the feature of each video to calculate the action class scores. Result The experiment is conducted on the Stanford-ECM dataset with sensor and video data. The dataset includes 23 types of human action video and heart rate and acceleration data synchronized with the collected video. Our experiment can be divided into three steps. First, we remove the 7 actions that do not meet the experimental conditions and select the remaining 16 actions as the experimental dataset. Then, we select three methods to perform experiments on zero-shot human action recognition. A comparison of the experimental results show that the results of zero-shot action recognition via two-stream GCNs and knowledge graphs (TS-GCN) method are approximately 8% higher than that of zero-shot image classification based on generated countermeasure network (ZSIC-GAN) method, which proves the auxiliary role of knowledge graphs in action description by using external semantic information and the advantage of GCN. Compared with the ZSIC-GAN and TS-GCN methods, our proposed method have recognition results that are 12% and 4% higher than that of the ZSIC-GAN and TS-GCN method, respectively, which proves that for zero-shot human-action recognition, the fusion method of the sensor and video features is better than the method that only uses video features. Furthermore, we verify the influence of the number of layers of GCN on the recognition accuracy and analyze the reasons for this result. The experimental results show that adding more layers to the three-layer model cannot significantly improve the recognition accuracy of the model. One of the potential reasons for this situation is that the amount of training data is too small, and an overfitting problem occurs in the deeper network. Conclusion Sensor and video data can comprehensively describe human activity patterns from different views, which provide convenience for zero-shot human-action recognition based on multimodal fusion. Unlike most of the multimodal fusion methods based on the text description of the action or the audio data and image features, our study uses the sensor and video features that are most related to the active state to realize the multimodal fusion, and pays close attention to the original features of the action. In general, our zero-shot human-action recognition framework based on multimodal fusion includes three parts: sensor feature-extraction module, classification module, and video feature-extraction module. This framework integrates video features and features extracted from sensor data. The two features are modeled by using the knowledge graphs, and the entire network is optimized by using classification loss function. The experimental results on the Stanford-ECM dataset demonstrate the effectiveness of our proposed zero-shot human-action recognition framework based on multimodal fusion. By fully fusing sensor and video features, we significantly improve the accuracy of zero-shot human-action recognition.
Keywords
|