Current Issue Cover
结合掩码定位和漏斗网络的6D姿态估计

李冬冬1, 郑河荣1, 刘复昌2, 潘翔1(1.浙江工业大学计算机科学与技术学院, 杭州 310023;2.杭州师范大学信息科学与技术学院, 杭州 311121)

摘 要
目的 6D姿态估计是3D目标识别及重建中的一个重要问题。由于很多物体表面光滑、无纹理,特征难以提取,导致检测难度大。很多算法依赖后处理过程提高姿态估计精度,导致算法速度降低。针对以上问题,本文提出一种基于热力图的6D物体姿态估计算法。方法 首先,采用分割掩码避免遮挡造成的热力图污染导致的特征点预测准确率下降问题。其次,基于漏斗网络架构,无需后处理过程,保证算法具有高效性能。在物体检测阶段,采用一个分割网络结构,使用速度较快的YOLOv3(you only look once v3)作为网络骨架,目的在于预测目标物体掩码分割图,从而减少其他不相关物体通过遮挡带来的影响。为了提高掩码的准确度,增加反卷积层提高特征层的分辨率并对它们进行融合。然后,针对关键点采用漏斗网络进行特征点预测,避免残差网络模块由于局部特征丢失导致的关键点检测准确率下降问题。最后,对检测得到的关键点进行位姿计算,通过PnP (perspective-n-point)算法恢复物体的6D姿态。结果 在有挑战的Linemod数据集上进行实验。实验结果表明,本文算法的3D误差准确性为82.7%,与热力图方法相比提高了10%;2D投影准确性为98.9%,比主流算法提高了4%;同时达到了15帧/s的检测速度。结论 本文提出的基于掩码和关键点检测算法不仅有效提高了6D姿态估计准确性,而且可以维持高效的检测速度。
关键词
6D pose estimation based on mask location and hourglass network

Li Dongdong1, Zheng Herong1, Liu Fuchang2, Pan Xiang1(1.College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China;2.School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China)

Abstract
Objective 6D pose estimation is a core problem in 3D object detection and reconstruction. Traditional pose estimation methods usually cannot handle textureless objects. Many post processing procedures have been employed to solve this issue, but they lead to a decline in pose estimation speed. To achieve a fast, single-shot solution, a 6D object pose estimation algorithm based on mask location and heat maps is proposed in this paper. In the prediction of the method, masks are first employed to locate objects, which can reduce the error caused by occlusion. To accelerate mask generation, you only look once v3 (YOLOv3) network is used as the backbone. The algorithm presented in this paper does not require any post processing. Our neural network directly predicts the location of key points at a fast speed. Method Our algorithm mainly consists of the following steps. First, a segmentation network structure in object detection is used to generate masks. To speed up this process, YOLOv3 is used as the network backbone. Based on the original detection, a branch structure is added by the segmentation network, and deconvolution is used to extract features under different resolutions. Moreover, 1×1, 3×3, and 1×1 kernel size convolution layers are added to each deconvolution. Finally, these features are fused and used for generating object target and mask map by the mean square error as the loss function in the regression loss. Second, an hourglass network is used to predict key points for each object. A form of encoding and decoding is adopted by the hourglass network. In the encoding stage, down sampling and the residual module are used to reduce the scale and extract features, respectively. Up sampling is used to restore the scale during the decoding. Each level of scale passes through the residual module, and the residual module extracts features without changing the data size. To prevent the feature map from losing local information when the scale is enlarged, a multiscale feature constraint is proposed. Two branches are split to retain the original scale information before each down sampling, and a skip layer containing only one convolution kernel of 1 is used. Stitching is performed at the same scale after one up sampling. Four different resolutions used in convolution are spliced into the up sampling, and the initial feature map is combined with the up sampled feature map. The hourglass network is not directly up sampled to the same resolution size as the network input to obtain the heat map by performing regression. Instead, the hourglass network is used as relay supervision, which restricts the final heat map result from the residual network. Finally, the 6D pose of the object is recovered through the perspective-n-point algorithm. Result In the experimental part, the challenging Linemod datasets are used to evaluate our algorithm. The Linemod dataset has 15 models and is difficult to detect due to the complexity of the object scene. The proposed method is compared with state-of-the-art methods in terms of 3D average distance (ADD) errors and 2D projection error. Results show that the ADD of the paper can reach 82.7%, which is 10% higher than that of the existing heat map method such as Betapose. A 98.9% projection accuracy is reached, and a 4% improvement in 2D projection error is achieved. On symmetrical objects, feature points are selected by Betapose method by considering the symmetry of objects to improve the pose accuracy. As a comparison, feature points are extracted by our algorithm by using the sift method without any symmetry knowledge. However, the results of our algorithm on symmetrical objects are still higher than those of Betapose. Furthermore, the algorithm in this paper has a higher ADD accuracy than Betapose. Accuracy is improved by 10%, whereas computation efficiency is decreased slightly (17~15 frames/s). Finally, ablation experiments are carried out to illustrate the effects of hourglass and the mask module. The result of the algorithm is reduced by 5.4% if the hourglass module is removed. Similarly, the accuracy of the network is reduced by 2.3% if the mask module is removed. All experimental results show that the proposed network is the key to improving the overall performance of pose estimation. Conclusion A mask segmentation and key point detection network is proposed in this paper to improve the algorithm, which can avoid a large amount of post processing, maintain the speed of the algorithm, and improve the accuracy of the algorithm in pose estimation. The experimental results demonstrate that our method is efficient and outperforms other recent convolutional neural network (CNN)-based approaches, and the detection speed is consistent with existing methods.
Keywords

订阅号|日报