结合稀疏表示和深度学习的视频中3D人体姿态估计
摘 要
目的 2D姿态估计的误差是导致3D人体姿态估计产生误差的主要原因,如何在2D误差或噪声干扰下从2D姿态映射到最优、最合理的3D姿态,是提高3D人体姿态估计的关键。本文提出了一种稀疏表示与深度模型联合的3D姿态估计方法,以将3D姿态空间几何先验与时间信息相结合,达到提高3D姿态估计精度的目的。方法 利用融合稀疏表示的3D可变形状模型得到单帧图像可靠的3D初始值。构建多通道长短时记忆MLSTM(multi-channel long short term memory)降噪编/解码器,将获得的单帧3D初始值以时间序列形式输入到其中,利用MLSTM降噪编/解码器学习相邻帧之间人物姿态的时间依赖关系,并施加时间平滑约束,得到最终优化的3D姿态。结果 在Human3.6M数据集上进行了对比实验。对于两种输入数据:数据集给出的2D坐标和通过卷积神经网络获得的2D估计坐标,相比于单帧估计,通过MLSTM降噪编/解码器优化后的视频序列平均重构误差分别下降了12.6%,13%;相比于现有的基于视频的稀疏模型方法,本文方法对视频的平均重构误差下降了6.4%,9.1%。对于2D估计坐标数据,相比于现有的深度模型方法,本文方法对视频的平均重构误差下降了12.8%。结论 本文提出的基于时间信息的MLSTM降噪编/解码器与稀疏模型相结合,有效利用了3D姿态先验知识,视频帧间人物姿态连续变化的时间和空间依赖性,一定程度上提高了单目视频3D姿态估计的精度。
关键词
Video based 3D human pose estimation combining sparse representation and deep learning
Wang Weinan, Zhang Rong, Guo Lijun(Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China) Abstract
Objective 3D human pose estimation from monocular videos has become an open research problem in the computer vision and graphics community for a long time. An understanding of human posture and limb articulation is important for high-level computer vision tasks, such as human-computer interaction, augmented and virtual reality, and human action or activity recognition. The recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end to end for direct image prediction. The top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps, as follows:using a state-of-the-art 2D pose estimator to estimate the 2D poses from images and then mapping them into 3D space. Results indicate that a large portion of the error of modern deep 3D pose estimation systems stems from 2D pose estimation error. Therefore, mapping a 2D pose containing error or noise into its optimum and most reasonable 3D pose is crucial. We propose a 3D pose estimation method by jointly using a sparse representation and a depth model. Through this method, we combine the spatial geometric priori of 3D poses with temporal information to improve the 3D pose estimation accuracy. Method First, we use a 3D variable shape model that integrates sparse representation (SR) to represent rich 3D human posture changes. A convex relaxation method based on L1/2 regularization is used to transform the nonconvex optimization problem of a single-frame image in a shape-space model into a convex programming problem and provide reasonable initial values for a single frame of image. In this manner, the possibility of ambiguous reconstructions is considerably reduced. Second, the initial 3D poses obtained from the SR module, regarded as the 3D data with noise, are fed into a multi-channel long short term memory (MLSTM) denoising en-decoder in the form of pose sequences in temporal dimension. The 3D data with noise are converted into three components of X, Y, and Z to ensure the spatial structure of the 3D pose. For each component, multilayer LSTM cells are used to capture the different frames of time variation. The output of the LSTM unit is not the optimization result on the corresponding component; it is the time dependence between the two adjacent frames of the character posture of the input sequence implicitly encoded by the hidden layer of the LSTM unit. The time information learned is added with the initial value by using residual connection to maintain the time consistency of the 3D pose and effectively alleviate the problem of sequence jitter. Moreover, the shaded joints can be corrected by smoothing the constraint between the two frames. Lastly, we obtain the optimized 3D pose estimation results by decoding the last linear layer. Result A comparative experiment is conducted to verify the validity of the proposed method. The method is conducted using the Human3.6M dataset, and the results are compared with the state-of-the-art methods. The quantitative evaluation metrics contain a common approach used to align the predicted 3D pose with the ground truth 3D pose using a similarity transformation. We use the average error per joint in millimeters between the estimated and the ground truth 3D pose. 2D joint ground truth and 2D pose estimations using a convolutional network are separately used as inputs. The quantitative experimental results suggest that the proposed method can remarkably improve the 3D estimation accuracy. When the input data are the 2D joint ground truth given by the Human 3.6 M dataset, the average reconstruction error is decreased by 12.6% after the optimization of our model as compared with individual frame estimation. Compared with the existing sparse model method based on video, the average reconstruction error is decreased by 6.4% after using our method. When the input data are 2D pose estimations using a convolutional network, the average reconstruction error is decreased by 13% after the optimization of our model as compared with single frame estimation. Compared with the existing depth model method, the average reconstruction error is decreased by 12.8% after using our method. Compared with the existing sparse model method based on video, the average reconstruction error is decreased by 9.1% after using our method. Conclusion Combining our MLSTM en-decoder based on temporal information with the sparse model, we adequately exploit the 3D pose prior knowledge, temporal, and spatial dependence of continuous human pose changes and achieve a remarkable improvement in monocular video 3D pose estimation accuracy.
Keywords
pose estimation 3D human pose sparse representation long short term memory (LSTM) residual connection
|