深度时空能量特征表示下的人体行为识别
摘 要
目的 利用深度图序列进行人体行为识别是机器视觉和人工智能中的一个重要研究领域,现有研究中存在深度图序列冗余信息过多以及生成的特征图中时序信息缺失等问题。针对深度图序列中冗余信息过多的问题,提出一种关键帧算法,该算法提高了人体行为识别算法的运算效率;针对时序信息缺失的问题,提出了一种新的深度图序列特征表示方法,即深度时空能量图(depth spatial-temporal energy map,DSTEM),该算法突出了人体行为特征的时序性。方法 关键帧算法根据差分图像序列的冗余系数剔除深度图序列的冗余帧,得到足以表述人体行为的关键帧序列。DSTEM算法根据人体外形及运动特点建立能量场,获得人体能量信息,再将能量信息投影到3个正交轴获得DSTEM。结果 在MSR_Action3D数据集上的实验结果表明,关键帧算法减少冗余量,各算法在关键帧算法处理后运算效率提高了20% 30%。对DSTEM提取的方向梯度直方图(histogram of oriented gradient,HOG)特征,不仅在只有正序行为的数据库上识别准确率达到95.54%,而且在同时具有正序和反序行为的数据库上也能保持82.14%的识别准确率。结论 关键帧算法减少了深度图序列中的冗余信息,提高了特征图提取速率;DSTEM不仅保留了经过能量场突出的人体行为的空间信息,而且完整地记录了人体行为的时序信息,在带有时序信息的行为数据上依然保持较高的识别准确率。
关键词
Action recognition under depth spatial-temporal energy feature representation
Chao Xin1, Hou Zhenjie1,2, Li Xing1, Liang Jiuzhen1, Huan Juan1, Liu Haoyu1(1.School of Information Science&Engineering, Changzhou University, Changzhou 213164, China;2.Jiangsu Province Networking and Mobile Internet Technology Engineering Key Laboratory, Huai'an 223003, China) Abstract
Objective Action recognition is a research hotspot in machine vision and artificial intelligence. Action recognition has been applied to human-computer interaction, biometrics, health monitoring, video surveillance systems, somatosensory game, robotics, and other fields. Early studies about action recognition are mainly performed on color video sequences acquired by RGB cameras. However, color video sequences are insensitive to illumination changes. With the development of imaging technology, especially with the launching of deep cameras, researchers begin to conduct human action recognition studies on depth map sequences obtained by deep cameras. However, numerous problems still exist in studies, such as excessive redundant information in the depth map sequences and missing temporal information in the generated feature map. These problems decrease the computational efficiency of human action recognition algorithms and reduce the final accuracy of human action recognition. Aiming at the problem of excessive redundant information in the depth map sequence, this study proposes a key frame algorithm. This algorithm decreases the redundant frames from the depth map sequence. The key frame algorithm improves the computational efficiency of human action recognition algorithms. At the same time, the feature map is accurate in representing human action with the key frame algorithm processing. Aiming at the problem of missing temporal information in the feature map generated by the depth map sequence, this study presents a new representation, namely, depth spatial-temporal energy map (DSTEM). This algorithm completely preserves the temporal information of the depth map sequence. DSTEM improves the accuracy of human action recognition when performing on the database with temporal information. Method The key frame algorithm first performs image difference operation between the two adjacent frames of the depth map sequence to produce a differential image sequence. Next, redundancy coefficients of each frame are achieved in the differential image sequence. Then, the redundant frame is placed and deleted by the maximum redundancy coefficient in the depth map sequence. Finally, the above steps are repeated a plurality of times to obtain a key frame sequence to express human action. This algorithm removes redundant information in the depth map sequence by removing redundant frames of the depth map sequence. The DSTEM algorithm first builds the energy field of the human body to obtain the energy information of the human action according to the shape and motion characteristics of the body. Next, the human energy information is projected onto three orthogonal cartesian planes to generate 2D projection maps of three angles. Subsequently, two 2D projection maps are selected and projected on three orthogonal axes to generate 1D energy distribution list. Finally, the 1D energy distribution lists are spliced in temporal to form DSTEM of three orthogonal axes. DSTEM reflects the temporal information of human action through the projection of energy information of human action on three orthogonal axes. Compared with the previous feature map algorithm, DSTEM not only preserves the spatial contour of human action, but also uses the projection of energy information of human action on three orthogonal axes to completely record the temporal information of human action. Result In this study, the public dataset MSR_Action3D is used to evaluate the effectiveness of the proposed methods. The experimental results show that the key frame algorithm removes the redundant information of the depth map sequence. The computational efficiency of each feature graph algorithm is improved after the key frame algorithm is processed. Particularly, the DSTEM algorithm improves the computational efficiency by nearly 30% after key frame processing because DSTEM is sensitive to redundant frames in the depth map sequence. After the key frame algorithm is processed, the accuracy of action recognition on each algorithm is improved. Especially, the recognition accuracy of DSTEM in each test is obviously improved, and the accuracy of recognition increases nearly by 5%. The experimental results also show that DSTEM-HOG(histogram of oriented gradient) receives the highest accuracy of human action recognition in all tests or it is consistent with the highest accuracy of human action recognition. DSTEM-HOG has an accuracy of 95.54% on the database with only positive actions. The accuracy is higher than the recognition accuracy of other algorithms. This result indicates that DSTEM completely preserves the spatial information of the depth map sequence. Moreover, DSTEM-HOG maintains an accuracy of 82.14% on the database with both positive and reverse actions. The recognition accuracy is nearly 40% higher than the other algorithms. The recognition rate of DSTEM-HOG is 34% higher than that of MHI(motion history image)-HOG, which retains part of the temporal information. The recognition rate of DSTEM-HOG is 50% higher than that of MHI-HOG and DMM(depth motion map)-HOG, which do not retain temporal information. Result indicates that DSTEM completely describes the temporal information of the depth map sequence. Conclusion The experimental results show that the proposed methods are effective. The key frame algorithm reduces the redundant frames in the depth map sequence and improves the computational efficiency of the human action recognition algorithms. After the key frame algorithm is processed, the accuracy of human action recognition is obviously improved on human action recognition algorithms. DSTEM not only retains the spatial information of actions, which is highlighted by the energy field but also completely records the temporal information of actions. In addition, DSTEM maintains the highest recognition accuracy when performing human action recognition on conventional databases. It also maintains superior recognition accuracy when performing human action recognition on the databases with temporal information. Results prove that DSTEM completely retains the spatial information and temporal information of human action. DSTEM also has the ability to distinguish between positive and reverse human action.
Keywords
action recognition depth map sequence temporal information depth spatial-temporal energy map (DSTEM) key frame
|