Current Issue Cover
跨阶段结构下的人体姿态估计

杨兴明, 周亚辉, 张顺然, 吴克伟, 孙永宣(合肥工业大学计算机与信息学院, 合肥 230601)

摘 要
目的 基于图像的人体姿态估计是计算机视觉领域中一个非常重要的研究课题,并广泛应用于人机交互、监控以及图像检索等方面。但是,由于人体视觉外观的多样性、遮挡和混杂背景等因素的影响,导致人体姿态估计问题一直是计算机视觉领域的难点和热点。本文主要关注于初始特征对关节点定位的作用,提出一种跨阶段卷积姿态机(CSCPM)。方法 首先,采用VGG (visual geometry group)网络获得初步的图像初始特征,该初始特征既是图像关节点定位的基础,同时,也由于受到自遮挡和混杂背景的干扰难以学习。其次,在初始特征的基础上,构建多层模型学习不同尺度下的结构特征,同时为了解决深度学习中的梯度消失问题,在后续的各层特征中都串联该初始特征。最后,设计了多尺度关节点定位的联合损失,用于学习深度网络参数。结果 本文实验在两大人体姿态数据集MPII (MPII human pose dataset)和LSP (leeds sport pose)上分别与近3年的人体姿态估计方法进行了定性与定量比较,在MPII数据集中,模型的总检测率为89.1%,相比于性能第2的模型高出了0.7%;在LSP数据集中,模型的总检测率为91.0%,相比于性能第2的模型高出了0.5%。结论 实验结果表明,初始特征学习能够有效判断关节点的自遮挡和混杂背景干扰情况,引入跨阶段结构的CSCPM姿态估计模型能够胜出现有人体姿态估计模型。
关键词
Human pose estimation based on cross-stage structure

Yang Xingming, Zhou Yahui, Zhang Shunran, Wu Kewei, Sun Yongxuan(School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China)

Abstract
Objective The rapid development of modern network and computer technology has led people to gradually move toward the information and intelligent era. In human pose estimation, advanced semantic interpretation and judgment results are obtained through processing, analyzing, and comprehending the input image or image sequence with computer. Human pose estimation has a wide range of applications and development prospects in human-computer interaction, surveillance, image retrieval, motion analysis, virtual reality, perception interface, etc. Thus, human pose estimation based on image is an extremely important research topic in the field of computer vision. However, the problem of human pose estimation has always been a difficult and hot topic because of the influence of the diversity of human visual appearance, occlusion, and complex background. In this paper we consider the problem of human pose estimation from a single still image. Traditional 2D human pose estimation algorithms are based on the pictorial strictures (PS) models. Solving the problem with the following is difficult:human pose estimation algorithms based on the PS model need to detect human parts in images, but in real world, detecting a single member of the human body is very difficult because of the background noise and the wide variety of human appearance. In recent years, the development of deep learning has led to new methods for human pose estimation. Compared with traditional algorithms, deep models have deeper hierarchies and ability to learn more complex patterns. In this work, we mainly focus on the effect of initial features on human joint point positioning and propose cross-stage convolutional pose machines (CSCPM). Method First, the VGG network is used to obtain the preliminary initial features of the image, which is the basis of the image joint point positioning. The VGG network inherits the frameworks of LeNet and AlexNet and adopts a 19-layer deep network. The VGG network is the preferred algorithm to extract convolutional neural network (CNN) features from the images. The initial features retain more original information because the VGG network directly processes the image. Learning parameters in the deep convolutional network is difficult due to the interference of self-occlusion and mixed background. Second, on the basis of initial features, a multistage model is constructed to study the structural features at different scales. The multistage model consists of a sequence of convolutional networks that repeatedly produce 2D belief maps for the location of each part. The initial features are concatenated in each subsequent stage feature to solve the problem of gradient disappearance in initial feature learning. The network is divided into six stages. The first and second stages use the original image as input, and the third to sixth stages use the feature maps produced by the second stage as input. Finally, the joint loss function of the multi-scale joint location is designed to learn parameters in the deep convolutional network. Each stage of the cross-stage convolutional pose machines (CSCPM) effectively enforces supervision in intermediate stages through the network. Intermediate supervision has the advantage that even though the full deep learning architecture can have many layers, it does not fall prey to the vanishing gradient problem as the intermediate loss functions replenish the gradients at each stage. We encourage the network to repeatedly arrive at such representation by defining a loss function at the output of each stage that minimizes the Euclidean distance between the predicted and ideal belief maps for each part. Result We evaluate the proposed method on two widely used benchmarks, namely, MPⅡ (MPⅡ human pose dataset) and extended LSP (leeds sport pose) dataset, and compare the method with other human pose estimation methods in the past three years in terms of qualitative and quantitative analyses. In the experiments, percentage of corrected keypoints (PCK) measure is used to evaluate the performance of human pose estimation methods, where a key-point location is considered correct if its distance to the ground truth location is no more than a certain threshold for the length of a portion of the body. The official benchmark on the MPⅡ dataset adopts PCKh (using portion of head length as reference) at 0.5, while the official benchmark on the LSP dataset adopts PCK at 0.2. In the MPⅡ dataset, the total detection rate of the model is 89.1%, which 0.7% points higher than that of the model with the second highest performance. In the LSP dataset, the total detection rate of the model is 91.0%, which is 0.5% points higher than that of the model with the second highest performance. The qualitative results fully show the benefits of the cross-stage structure. The detection results are improved in some scenes, such as occlusion and complex background, because the concatenated initial features retain the original information. Conclusion The human pose estimation model CSCPM is designed aiming at the failure cases of the convolutional pose machines (CPM) in some complex, scenes such as self-occlusion, mixed background, and joints of nearby people. The model provides a sequential prediction framework for the task of human pose estimation, which introduces a cross-stage structure based on the CPM model. The experimental results show that the proposed model improves the accuracy of human pose estimation and further accurately locates the points of the joins. The effectiveness of the proposed initial features learning and the benefit in the cross-stage structure are evaluated on two widely used human pose estimation benchmarks. Our approach achieves state-of-the-art performance on both datasets. The initial feature learning can effectively judge the self-occlusion and mixed background interference of the joints. The CSCPM, a human pose estimation model with cross-stage structure, is superior to existing human pose estimation models.
Keywords

订阅号|日报