多层语义融合CNN的步态人体语义分割
摘 要
目的 针对反恐、安防领域利用监控视频进行步态识别时由光照、拍摄角度、遮挡等多协变量引起的轮廓缺失、人体阴影和运算时间等问题,提出了一种基于RPGNet(Regin of Interest+Parts of Body Semantics+GaitNet)网络的步态人体语义分割方法。方法 该方法按照功能划分为R(region of interest)模块、P(parts of body semantics)模块和GNet(GaitNet)模块。R模块提取人体步态感兴趣区域,起到提升算法效率和图像去噪的作用。P模块借助LabelMe开源图像注释工具进行步态人体部位语义标注。GNet模块进行步态人体部位语义训练与分割。借鉴ResNet和RefineNet网络模型,设计了一种细节性步态语义分割网络模型。结果 对步态数据库1 380张图片进行了测试,RPGNet方法与6种人体轮廓分割方法进行了对比实验,实验结果表明RPGNet方法对细节和全局信息处理得都很精确,在0°、45°和90°视角都表现出较高的分割正确率。在多人、戴帽和遮挡条件下,实验结果表明RPGNet方法人体分割效果良好,能够满足步态识别过程中的实时性要求。结论 实验结果表明,RPGNet步态人体语义分割方法在多协变量情况下能够有效进行步态人体语义分割,同时也有效提高了步态识别的识别率。
关键词
Semantic segmentation of gait body with multilayer semantic fusion convolutional neural network
Zhi Shuangshuang1, Zhao Qinghui2, Tang Jin2(1.Engineering Training Center, Xi'an Polytechnic University, Xi'an 710048, China;2.School of Information Science and Engineering, Center South University, Changsha 410083, China) Abstract
Objective Gait recognition has many advantages over DNA, fingerprint, iris, and 2D and 3D face recognition methods. For example, the observer does not need to cooperate in this method. In addition, the method can be performed at a relatively long distance and at a relatively lower image quality. Moreover, a person's gait is difficult to camouflage and hide. Therefore, gait recognition has become a research hotspot in recent years, and it is widely used in security, anti-terrorism, and medical applications, such as personal identification, treatment, and rehabilitation of abnormal leg and foot diseases. This paper proposes a novel gait human semantic segmentation method based on RPGNet (Region of Interest + Parts of Body Semantics + GaitNet) network to solve the problems of contour loss, human shadow, and long computing time caused by lighting, camera angles, and obstructions when gait recognition is performed by using a surveillance video in the field of anti-terrorism and security. Method This method is divided into the R (region of interest), P (parts of body semantics), and GNet (GaitNet) modules according to function. The R module obtains the area of interest of the gait body, which could improve computing efficiency and reduce image noise. First, the original image is processed by using the background subtraction method and translated into a binary image. Then, the image is operated by morphological processing methods, such as expansion, corrosion, and filtering. Second, we search the connected region of the human body in the graph and frame that area with a rectangular frame. Finally, we enlarge the length and width of the rectangular frame by a quarter and clip the image. Therefore, we obtain the connected regions of interest. The main function of the P module is to annotate gait body parts semantically by using LabelMe, an open-source image annotation tool. We train the human body according to its position. The semantics of the human body is defined as six parts:head, trunk, upper arm, lower arm, thigh, and lower leg. We map the semantics of the human body parts to six RGB information one by one. Then, we use LabelMe to annotate the image semantics captured by the camera, which generates the structure file of the image semantics annotation in XML format. Finally, the XML file and the original RGB image are imported into MATLAB to generate a human body part semantic annotation map. The GNet module designs a detailed semantic segmentation network model of the gait body. In the light of existing ResNet and RefineNet network models, we use ResNet model for reference to extract the high-level and the low-level semantics of the gait human body. The RefineNet network model is used to integrate low-level semantics with high-level semantics. Multi-resolution images generate fine low-level semantic feature maps and rough high-level semantic feature maps through residual network convolution units. Then, the feature maps are input into the fusion unit of multi-resolution feature map to generate the fused feature maps. Afterwards, the chained residual pools the fused feature maps to generate the fused pooled feature maps. Furthermore, the pooled feature maps of multi-resolution fusion are processed by output convolution. Thus, we obtain the semantically segmented feature maps. Finally, we use the softmax classifier to output the final gait semantics segmentation image by using bilinear interpolation. Through many experiments, we find that when the resolution is 1/8, 1/16, 1/32, and 1/64 of the original image, the semantics segmentation effect of the gait human body is better than that in other situations. Result A test conducted on 1 380 images from the gait database shows that the proposed RPGNet method has a higher segmentation accuracy in local and global information processing compared with six human contour segmentation methods, especially at viewing angles of 0°, 45°, and 90°. In this study, we define the formula of segmentation accuracy ρ, and experience shows that the accuracy of human gait segmentation is positively correlated with the rate of gait recognition. After a series of experiments, the RPGNet image semantics segmentation algorithm under the segmentation accuracy ρ, whether at viewing angles of 0°, 45°, or 90°, shows a high segmentation accuracy. Experiments on human segmentation under multi-person, hat-wearing, and occlusion conditions show that the RPGNet-based segmentation algorithm has a good grasp of global and local segmentation, high segmentation precision, and high contour integrity. The RPGNet algorithm can process eight frames of pictures per second, which could meet the real-time performance requirements of gait recognition. Conclusion The proposed gait semantic segmentation method can not only solve the problem of missing contours and human shadow caused by multi-covariates in outdoor conditions but also deal with the problem of contour difficult segmentation in conditions of outdoor multi-person segmentation, hat wearing, and occlusion. The use of the RPGNet-based human semantic segmentation method can improve the recognition rate of a gait recognition system, as indicated by an experiment on the relationship between recognition rate and segmentation accuracy. Simulations and analyses prove that the proposed RPGNet method shows improved human segmentation effect and high gait recognition rate in the conditions of multi-person scenes, people wearing a hat, and shielding. The training model of the image semantics segmentation algorithm is based on the deep learning model and the use of GPU to accelerate training. The training cost is higher than that of traditional machine learning methods, and the segmentation process is slower than that of traditional machine learning methods such as background subtraction. Further work will minimize the depth and complexity of the network model to improve the speed of training and testing.
Keywords
gait recognition semantic segmentation convolutional neural network(CNN) multi-covariate human contour segmentation
|