Current Issue Cover
单目深度估计技术进展综述

黄军1,2,3, 王聪1,2,3, 刘越1,2,3, 毕天腾1,2,3(1.北京理工大学光电学院, 北京 100081;2.北京电影学院未来影像高精尖创新中心, 北京 100088;3.中国工业互联网研究院, 北京 100846)

摘 要
单幅图像深度估计是计算机视觉中的经典问题,对场景的3维重建、增强现实中的遮挡及光照处理具有重要意义。本文回顾了单幅图像深度估计技术的相关工作,介绍了单幅图像深度估计常用的数据集及模型方法。根据场景类型的不同,数据集可分为室内数据集、室外数据集与虚拟场景数据集。按照数学模型的不同,单目深度估计方法可分为基于传统机器学习的方法与基于深度学习的方法。基于传统机器学习的单目深度估计方法一般使用马尔可夫随机场(MRF)或条件随机场(CRF)对深度关系进行建模,在最大后验概率框架下,通过能量函数最小化求解深度。依据模型是否包含参数,该方法又可进一步分为参数学习方法与非参数学习方法,前者假定模型包含未知参数,训练过程即是对未知参数进行求解;后者使用现有的数据集进行相似性检索推测深度,不需要通过学习来获得参数。对于基于深度学习的单目深度估计方法本文详细阐述了国内外研究现状及优缺点,同时依据不同的分类标准,自底向上逐层级将其归类。第1层级为仅预测深度的单任务方法与同时预测深度及语义等信息的多任务方法。图片的深度和语义等信息关联密切,因此有部分工作研究多任务的联合预测方法。第2层级为绝对深度预测方法与相对深度关系预测方法。绝对深度是指场景中的物体到摄像机的实际距离,而相对深度关注图片中物体的相对远近关系。给定任意图片,人的视觉更擅于判断场景中物体的相对远近关系。第3层级包含有监督回归方法、有监督分类方法及无监督方法。对于单张图片深度估计任务,大部分工作都关注绝对深度的预测,而早期的大多数方法采用有监督回归模型,即模型训练数据带有标签,且对连续的深度值进行回归拟合。考虑到场景由远及近的特性,也有用分类的思想解决深度估计问题的方法。有监督学习方法要求每幅RGB图像都有其对应的深度标签,而深度标签的采集通常需要深度相机或激光雷达,前者范围受限,后者成本昂贵。而且采集的原始深度标签通常是一些稀疏的点,不能与原图很好地匹配。因此不用深度标签的无监督估计方法是研究趋势,其基本思路是利用左右视图,结合对极几何与自动编码机的思想求解深度。
关键词
The progress of monocular depth estimation technology

Huang Jun1,2,3, Wang Cong1,2,3, Liu Yue1,2,3, Bi Tianteng1,2,3(1.School of Optoelectronics, Beijing Institute of Technology, Beijing 100081, China;2.Advanced Innovation Center for Future Visual Entertainment, Beijing Film Academy, Beijing 100088, China;3.China Academy of Industrial Internet, Beijing 100846, China)

Abstract
Depth estimation from a single image, a classical problem in computer vision, is important for scene reconstruction, occlusion, and illumination processing in augmented reality. In this paper, the recent related literature of single-image depth estimation are reviewed, and the commonly used datasets and methods are introduced. According to different types of scenes, the datasets can be divided into indoor, outdoor, and virtual scenes. In consideration of the different mathematical models, monocular depth estimation methods can be divided into traditional machine learning-based methods and deep learning-based methods. Traditional machine learning-based methods use a Markov random field or conditional random field to model the depth relationships of pixels in an image. In the framework of maximum a posteriori probability, the depth can be obtained by minimizing the energy function. According to whether the model contains parameters, traditional machine learning-based methods can be further divided into parameter and non-parameter learning methods. The former assumes that the model contains unknown parameters, and the training process obtains these unknown parameters. The latter uses existing datasets for similarity retrieval to infer depth, and no parameters need to be solved. In recent years, deep learning has promoted the development of computer vision in many fields. The current research situations of deep learning-based monocular depth estimation methods in China abroad are analyzed with their advantages and disadvantages. These methods are classified hierarchically in a bottom-up paradigm with reference to different classification criteria. The depth and semantics of images are closely related, and several works focus on multi-task joint learning. In the first level, single-depth estimation methods are segregated into single-task methods that predict only depth and multi-task methods that simultaneously predict depth and semantics. The second level contains absolute depth prediction methods and relative depth prediction methods. Absolute depth refers to the actual distance between the object in the scene and the camera, while relative depth focuses on the relative distance of the object in the picture. Given arbitrary images, people are often better at judging the relative distances of objects in the scene. The third level consists of supervised regression method, supervised classification method, and unsupervised method. For single-image depth estimation task, most works focus on the prediction of absolute depth, and most of the early methods use a supervised regression model. In this manner, the model regression on continuous depth values and the training data should contain depth labels. On the basis of the characteristics of the scene from far to near, several studies were conducted to solve the problem of depth estimation with classification methods. Supervised learning methods require each RGB image to have a corresponding depth label, whose acquisition usually requires a depth camera or radar. However, the depth camera is limited in scope, and the radar is expensive. Furthermore, the original depth collected by the depth camera is usually sparse and cannot precisely match the original image. Therefore, the unsupervised depth estimation methods that do not need a depth label have been the research trends in recent years. The basic idea is to combine the polar geometry based on left-right consistency with an automatic coding machine to obtain depth.
Keywords

订阅号|日报