结合局部平面参数预测的无监督单目图像深度估计
摘 要
目的 无监督单目图像深度估计是3维重建领域的一个重要方向,在视觉导航和障碍物检测等领域具有广泛的应用价值。针对目前主流方法存在的局部可微性问题,提出了一种基于局部平面参数预测的方法。方法将深度估计问题转化为局部平面参数估计问题,使用局部平面参数预测模块代替多尺度估计中上采样及生成深度图的过程。在每个尺度的深度图预测中根据局部平面参数恢复至标准尺度,然后依据针孔相机模型得到标准尺度深度图,以避免使用双线性插值带来的局部可微性,从而有效规避陷入局部极小值,配合在网络跳层连接中引入的串联注意力机制,提升网络的特征提取能力。结果 在KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)自动驾驶数据集上进行了对比实验以及消融实验,与现存无监督方法和部分有监督方法进行对比,相比于最优数据,误差性指标降低了10% 20%,准确性指标提升了2%左右,同时,得到的稠密深度估计图具有清晰的边缘轮廓以及对反射区域更优的鲁棒性。结论 本文提出的基于局部平面参数预测的深度估计方法,充分利用卷积特征信息,避免了训练过程中陷入局部极小值,同时对网络添加几何约束,使测试指标及视觉效果更加优秀。
关键词
Unsurpervised monocular image depth estimation based on the prediction of local plane parameters
Zhou Dake1,2, Tian Jing1, Yang Xin1(1.College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211100, China;2.Jiangsu Key Laboratory of Internet of Things and Control Technologies, Nanjing 211100, China) Abstract
Objective Scene depth information plays a vital role in many current research topics, such as 3D reconstruction, obstacle detection, and visual navigation. Obtaining dense and accurate depth image information often requires expensive equipment, resulting in high costs. The method of using color images for depth estimation does not require expensive equipment and has a wider range of applications. Stereo matching is a traditional method used for estimating the depth with RGB images. A large estimation error is found for weak texture regions because stereo matching relies heavily on feature matching. With the wide application of convolutional neural networks in image processing, the depth estimation of monocular images has been widely investigated. However, the monocular image is essentially a pathological problem because it lacks depth clues related to motion and stereo. Many methods are currently used to estimate the depth of monocular image. Without the use of real depth data, the method of using binocular images for unsupervised learning uses image reconstruction as a supervised signal to train a depth estimation model. This task currently has achieved a large breakthrough although depth estimation depends on the geometric features. How to effectively use the information in the shallow features of the image and how to add geometric constraints to the prediction output while ensuring high convergence performance have been widely investigated to improve the accuracy of depth estimation. In the commonly used multi-scale estimation, the sampling method of bilinear interpolation has local differentiability, easily making the network fall into a local minimum and affecting the training effect. A method based on local plane parameter prediction is proposed to address these problems. This method is applied to multi-scale prediction by using a completely differentiable method with geometric constraints, thereby effectively limiting the convergence of multi-scale depth map prediction in the same direction. Method This study presents an unsupervised monocular depth estimation network based on local plane parameter prediction. The main structure is a coding-decoding network and is mainly composed of three parts: a ResNet50-based coding network, a decoding network that introduces a serial double attention mechanism in the skip layer connection, and multi-scale prediction using local plane parameter estimation module. During the training, the network estimates the depth of an image in stereo images, reconstructs another view, and uses the real image of the other view as a supervision for training. Our training set includes 22 600 images in the KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) dataset. The model is built on PyTorch framework, and the input image is 640×192 pixels for training. NVIDIA GTX 2080 equipment is used for training, and the training involves 20 epochs. In the multi-scale prediction module, we convert the depth estimation problem into a local plane parameter estimation problem. The local plane parameter prediction module is used to replace upsampling and depth map generation in multi-scale estimation. The depth map prediction of each scale is restored to the standard scale in accordance with the local plane parameters. The standard scale depth map is obtained in accordance with a pinhole camera model to avoid the local differentiability caused by bilinear interpolation, thereby effectively avoiding falling into the local minimum value. A serial attention mechanism is introduced in the network layer hopping connection to obtain clear edge contour information. Result We compared our model with multiple unsupervised and supervised methods on the KITTI test dataset. Quantitative evaluation indicators include absolute relative error (Abs Rel), squared relative error (Sq Rel), linear root mean square error (RMSE), logarithmic root mean square error (RMSElog), and threshold accuracy index δ. The dense depth map results for each method are compared. The experimental results show that the proposed method performs well in the depth estimation of various errors and accuracy indicators. In the comparative test, the error indicators are relatively reduced by 10% to 20%, and the accuracy indicators are increased by 1% to 2%. The generated depth map has a relatively clear outline and can separate the important depth values of pedestrians and vehicles from the complex background. It also has a certain robustness to the reflection area, thereby improving the quality of depth estimation. We conducted a series of ablation experiments in the test set to clearly show the effectiveness of the proposed algorithm. Conclusion In this study, we proposed a depth estimation method based on local plane parameter prediction. The proposed method utilizes convolution feature information, avoids the local minimum during training, and adds geometric constraints to the network to obtain excellent test indicators and visual effects.
Keywords
unsupervised learning monocular depth estimation attention mechanism local plane parameters prediction local differentiability
|