Current Issue Cover
关键语义区域链提取的视频人体行为识别

马淼1, 李贻斌2, 武宪青1, 高金凤1, 潘海鹏1(1.浙江理工大学机械与自动控制学院, 杭州 310018;2.山东大学控制科学与工程学院, 济南 250100)

摘 要
目的 视频中的人体行为识别技术对智能安防、人机协作和助老助残等领域的智能化起着积极的促进作用,具有广泛的应用前景。但是,现有的识别方法在人体行为时空特征的有效利用方面仍存在问题,识别准确率仍有待提高。为此,本文提出一种在空间域使用深度学习网络提取人体行为关键语义信息并在时间域串联分析从而准确识别视频中人体行为的方法。方法 根据视频图像内容,剔除人体行为重复及冗余信息,提取最能表达人体行为变化的关键帧。设计并构造深度学习网络,对图像语义信息进行分析,提取表达重要语义信息的图像关键语义区域,有效描述人体行为的空间信息。使用孪生神经网络计算视频帧间关键语义区域的相关性,将语义信息相似的区域串联为关键语义区域链,将关键语义区域链的深度学习特征计算并融合为表达视频中人体行为的特征,训练分类器实现人体行为识别。结果 使用具有挑战性的人体行为识别数据集UCF (University of Central Florida)50对本文方法进行验证,得到的人体行为识别准确率为94.3%,与现有方法相比有显著提高。有效性验证实验表明,本文提出的视频中关键语义区域计算和帧间关键语义区域相关性计算方法能够有效提高人体行为识别的准确率。结论 实验结果表明,本文提出的人体行为识别方法能够有效利用视频中人体行为的时空信息,显著提高人体行为识别准确率。
关键词
Human action recognition in videos utilizing key semantic region extraction and concatenation

Ma Miao1, Li Yibin2, Wu Xianqing1, Gao Jinfeng1, Pan Haipeng1(1.Faculty of Mechanical Engineering and Automation, Zhejiang Sci-Tech University, Hangzhou 310018, China;2.School of Control Science and Engineering, Shandong University, Jinan 250100, China)

Abstract
Objective Human action recognition in videos aims to identify action categories by analyzing human action-related information and utilizing spatial and temporal cues. Research on human action recognition are crucial in the development of intelligent security, pedestrian monitoring, and clinical nursing; hence, this topic has become increasingly popular among researchers. The key point of improving the accuracy of human action recognition lies on how to construct distinctive features to describe human action categories effectively. Existing human action recognition methods fall into three categories:extracting visual features using deep learning networks, manually constructing image visual descriptors, and combining manual construction with deep learning networks. The methods that use deep learning networks normally operate convolution and pooling on small neighbor regions, thereby ignoring the connection among regions. By contrast, manual construction methods often have strong pertinence and poor adaptability to specific human actions, and its application scenarios are limited. Therefore, some researchers combine the idea of handmade features with deep learning computation. However, the existing methods still have problems in the effective utilization of the spatial and temporal information of human action, and the accuracy of human action recognition still needs to be improved. Considering the above problems, we research on how to design and construct distinguishable human action features and propose a new human action recognition method in which the key semantic information in the spatial domain of human action is extracted using a deep learning network and then connected and analyzed in the time domain. Method Human action videos usually record more than 24 frames per second; however, human poses do not change at this speed. In the computation of human action characteristics in videos, changes between consecutive video frames are usually minimal, and most human action information contained in the video is similar or repeated. To avoid redundant computations, we calculate the key frames of videos in accordance with the amplitude variation of the image content of interframes. Frames with repetitive content or slight changes are eliminated to avoid redundant calculation in the subsequent semantic information analysis and extraction. The calculated key frames contain evident changes of human body and human-related background and thus reveal sufficient human action information in videos for recognition. Then, to analyze and describe the spatial information of human action effectively, we design and construct a deep learning network to analyze the semantic information of images and extract the key semantic regions that can express important semantic information. The constructed network is denoted as Net1, which is trained by transfer learning and can use continuous convolutional layers to mine the semantic information of images. The output data of Net1 provides image regions, which contain various kinds of foreground semantic information and region scores, which represent the probability of containing foreground information. In addition, a nonmaximal suppression algorithm is used to eliminate areas that have too much overlap. Afterward, the key semantic regions are classified into person and nonperson regions, and then the position and proportion of person regions are used to distinguish the main person and the secondary persons. Moreover, object regions that have no relationship with the main person are eliminated, and only foreground regions that reveal human action-related semantic information are reserved. Afterward, a Siamese network is constructed to calculate the correlation of key semantic regions among frames and concatenate key semantic regions in the temporal domain. The proposed Siamese network is denoted as Net2, which has two inputs and one output; Net2 can be used to mine deeply and measure the similarity between two input image regions, and the output values are used to express the similarity. The constructed Net2 can concatenate the key semantic regions into a semantic region chain to ensure the time consistency of semantic information, and express human action change information in time domain more effectively. Moreover, we tailor the feature map of Net1 using the interpolation and scaling method, in order to obtain feature submaps of uniform size. That is, each semantic region chain corresponds to a feature matrix chain. Given that the length of each feature matrix chain is different, the maximum fusion method is used to fuse the feature matrix chain and obtain a single fused matrix, which reveals one kind of video semantic information. We stack the fused matrix from all feature matrix chains together and then design and train a classifier, which consists of two fully connected layers and a support vector machine. The output of the classifier is the final human action recognition result for videos. Result The UCF(University of Central Florida)50 dataset, a publicly available challenging human action recognition dataset, is used to verify the performance of our proposed human action recognition method. In this dataset, the average human action recognition accuracy of the proposed method is 94.3%, which is higher than that of state-of-the-are methods, such as that based on optical flow motion expression (76.9%), that based on a two-stream convolutional neural network (88.0%), and that based on SURF(speeded up robust features) descriptors and Fisher encoding (91.7%). In addition, the proposed crucial algorithms of the semantic region chain computation and the key semantic region correlation calculation are verified through a control experiment. Results reveal that the two crucial algorithms effectively improve the accuracy of human action recognition. Conclusion The proposed human action recognition method, which uses semantic region extraction and concatenation, can effectively improve the accuracy of human action recognition in videos.
Keywords

订阅号|日报