多特征融合的行为识别模型
谭等泰1,2, 李世超1, 常文文3, 李登楼2(1.甘肃政法大学公安技术学院, 兰州 730070;2.甘肃政法大学司法鉴定中心, 兰州 730070;3.兰州交通大学电子与信息工程学院, 兰州 730070) 摘 要
目的 视频行为识别和理解是智能监控、人机交互和虚拟现实等诸多应用中的一项基础技术,由于视频时空结构的复杂性,以及视频内容的多样性,当前行为识别仍面临如何高效提取视频的时域表示、如何高效提取视频特征并在时间轴上建模的难点问题。针对这些难点,提出了一种多特征融合的行为识别模型。方法 首先,提取视频中高频信息和低频信息,采用本文提出的两帧融合算法和三帧融合算法压缩原始数据,保留原始视频绝大多数信息,增强原始数据集,更好地表达原始行为信息。其次,设计双路特征提取网络,一路将融合数据正向输入网络提取细节特征,另一路将融合数据逆向输入网络提取整体特征,接着将两路特征加权融合,每一路特征提取网络均使用通用视频描述符——3D ConvNets (3D convolutional neural networks)结构。然后,采用BiConvLSTM (bidirectional convolutional long short-term memory network)网络对融合特征进一步提取局部信息并在时间轴上建模,解决视频序列中某些行为间隔相对较长的问题。最后,利用Softmax最大化似然函数分类行为动作。结果 为了验证本文算法的有效性,在公开的行为识别数据集UCF101和HMDB51上,采用5折交叉验证的方式进行整体测试与分析,然后针对每类行为动作进行比较统计。结果表明,本文算法在两个验证集上的平均准确率分别为96.47%和80.03%。结论 通过与目前主流行为识别模型比较,本文提出的多特征模型获得了最高的识别精度,具有通用、紧凑、简单和高效的特点。
关键词
Multi-feature fusion behavior recognition model
Tan Dengtai1,2, Li Shichao1, Chang Wenwen3, Li Denglou2(1.School of Public Security and Technology, Gansu University of Political Science and Law, Lanzhou 730070, China;2.GSIPSL Center of Judicial Expertise, Gansu University of Political Science and Law, Lanzhou 730070, China;3.School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China) Abstract
Objective With the rapid development of internet technology and the increasing popularity of video shooting equipment (e.g., digital cameras and smart phones), online video services have shown an explosive growth. Short videos have become indispensable sources of information for people in their daily production and life. Therefore, identifying how these people understand these videos is critical. Videos contain rich amounts of hidden information as these media can store more information compared with traditional ones, such as images and texts. Videos also show complexity in their space-time structure, content, temporal relevance, and event integrity. Given such complexities, behavior recognition research is presently facing challenges in extracting the time domain representation and features of videos. To address these difficulties, this study proposes a behavior recognition model based on multi-feature fusion. Method The proposed model is mainly composed of three parts, namely, the time domain fusion, two-way feature extraction, and feature modeling modules. The two- and three-frame fusion algorithms are initially adopted to compress the original data by extracting high- and low-frequency information from videos. This approach not only retains most information contained in these videos but also enhances the original dataset to facilitate the expression of original behavior information. Second, based on the design of a two-way feature extraction network, detailed features are extracted from videos through the positive input of the fused data to the network, whereas overall features are extracted through the reserve input of these data. A weighted fusion of these features is then achieved by using the common video descriptor, 3D ConvNets (3D convolutional neural networks) structure. Afterward, BiConvLSTM (bidirectional convolutional long short-term memory network) is used to further extract the local information of the fused features and to establish a model on the time axis to address the relatively long behavior intervals in some video sequences. Softmax is then applied to maximize the likelihood function and to classify the behavioral actions. Result To verify its effectiveness, the proposed algorithm was tested and analyzed on public datasets UCF101 and HMDB51. Results of a five-fold cross-validation show that this algorithm has average accuracies of 96.47% and 80.03%for these datasets, respectively. Comparative statistics for each type of behavior show that the classification accuracy of the proposed algorithm is approximately equal in almost all categories. Conclusion Compared with the available mainstream behavior recognition models, the proposed multi-feature model achieves higher recognition accuracy and is more universal, compact, simple, and efficient. The accuracy of this model is mainly improved via two- and three-frame fusions in the time domain to facilitate video information analysis and behavior information expression. The network is extracted by a two-way feature to efficiently determine the spatio-temporal features of videos. The BiConvLSTM network is then applied to further extract the features and establish a timing relationship.
Keywords
behavior recognition two-way feature extraction network 3D convolutional neural networks (3D ConvNets) bidirectional convolutional long short-term memory network (BiConvLSTM) weighted fusion high-frequency feature low-frequency feature
|