Current Issue Cover
面向智能监控的行为识别

马钰锡, 谭励, 董旭, 于重重(北京工商大学计算机与信息工程学院食品安全大数据技术北京市重点实验室, 北京 100048)

摘 要
目的 为了进一步提高智能监控场景下行为识别的准确率和时间效率,提出了一种基于YOLO(you only look once:unified,real-time object detection)并结合LSTM(long short-term memory)和CNN(convolutional neural network)的人体行为识别算法LC-YOLO(LSTM and CNN based on YOLO)。方法 利用YOLO目标检测的实时性,首先对监控视频中的特定行为进行即时检测,获取目标大小、位置等信息后进行深度特征提取;然后,去除图像中无关区域的噪声数据;最后,结合LSTM建模处理时间序列,对监控视频中的行为动作序列做出最终的行为判别。结果 在公开行为识别数据集KTH和MSR中的实验表明,各行为平均识别率达到了96.6%,平均识别速度达到215 ms,本文方法在智能监控的行为识别上具有较好效果。结论 提出了一种行为识别算法,实验结果表明算法有效提高了行为识别的实时性和准确率,在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。
关键词
Action recognition for intelligent monitoring

Ma Yuxi, Tan Li, Dong Xu, Yu Chongchong(College of Computer & Information Engineering, Beijing Technology & Business University, Beijing 100048, China)

Abstract
Objective The mainstream methods of action recognition still experience two main challenges, that is the extraction of target features and the speed and real-time of the overall process of action recognition. At present, most of the state-of-the-art methods use CNN(convolutional neural network) to extract depth features. However, CNN has a large computational complexity, and most of the regions in the video stream are not target images. The feature extraction of an entire image is certainly expensive. Target detection algorithms, such as optical flow method, are not real-time; unstable; susceptible to external environmental conditions, such as illumination, camera angle, and distance; increase the amount of calculation; and reduce time efficiency. Therefore, a human action recognition algorithm called LC-YOLO(LSTM and CNN based on YOLO), which is based on YOLO(you only look once:unified, real-time object detection) combined with LSTM(long short-term memory) and CNN, is proposed to improve the accuracy and time efficiency of action recognition in intelligent surveillance scenarios. Method The LC-YOLO algorithm mainly consists of three parts, namely, target detection, feature extraction, and action recognition. YOLO target detection is added as an aid to the mainstream method system of CNN+LSTM. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size, location, and other information are obtained; features are extracted; and noise data are efficiently removed from unrelated areas of the image. Combined with LSTM modeling and processing time series, the final action recognition is made for the sequence of actions in video surveillance. Generally, the proposed model is an end-to-end deep neural network that uses the input raw video action sequence as input and returns the action category. The specific process of the single action recognition of the LC-YOLO algorithm can be described as follows. 1) YOLO is used to extract the position and confidence information (x,y,w,h,c), which has a 45 frame/s speed, can realize real-time detection of surveillance video when a specific action frame is detected; Under the training of a large number of datasets, the accurate rate of YOLO action detection can reach more than 90%. 2) On the basis of target detection, the target range image content is acquired and retained, and the noise data interference of the remaining background parts is removed, which extracts complete and accurate target features. A 4 096-dimensional depth feature vector is extracted by using a VGGNet-16 model and is returned to the recognition module combined with the target size and position information (x,y,w,h,c) predicted by YOLO. 3) In comparison with a standard RNN, the LSTM architecture uses memory cells to store and output information by using the LSTM unit as the identification module, thereby determining the temporal relationship of multiple target actions. The action category of the entire sequence of actions is outputted. In comparison with the work conducted by predecessors, the contributions of the proposed algorithm are as follows. 1) Instead of motion foreground extraction, R-CNN, and other target detection methods, the YOLO algorithm which is faster and more efficient, is used in this study. 2) The target size and position information are obtained when the target area is locked, and the interference information of the unrelated area in the picture can be removed, thereby effectively utilizing CNN to extract the depth feature. Moreover, the accuracy of feature extraction and overall time efficiency of behavior recognition are improved. Result Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96.6%, the average recognition speed reaches 215 ms, and the proposed method has a good effect on the action recognition of intelligent monitoring. Conclusion This study presents a human action recognition algorithm called LC-YOLO, which is based on YOLO combined with LSTM and CNN. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size, location, and other information are obtained; features are extracted; and the noise data of unrelated regions in the image are efficiently removed, which reduces the computational complexity of feature extraction and time complexity of behavior recognition. Experimental results in the public action recognition datasets KTH and MSR show that they have better adaptability and broad application prospects in intelligent monitoring with high real-time requirements and complex scenes.
Keywords

订阅号|日报