Current Issue Cover
三维视觉前沿进展

龙霄潇1,2, 程新景2, 朱昊3,2, 张朋举4,5, 刘浩敏6, 李俊7, 郑林涛7, 胡庆拥8, 刘浩9, 曹汛3, 杨睿刚2, 吴毅红4,5, 章国锋10, 刘烨斌11, 徐凯7, 郭裕兰7, 陈宝权12(1.香港大学, 香港 999077;2.际络科技(上海)有限公司, 上海 200000;3.南京大学, 南京 210023;4.中国科学院自动化研究所, 北京 100190;5.中国科学院大学人工智能学院, 北京 100190;6.商汤研究院, 杭州 311215;7.国防科技大学, 长沙 410073;8.牛津大学, 牛津 OX13QR;9.中山大学, 广州 510275;10.浙江大学, 杭州 310058;11.清华大学, 北京 100085;12.北京大学, 北京 100871)

摘 要
在自动驾驶、机器人、数字城市以及虚拟/混合现实等应用的驱动下,三维视觉得到了广泛的关注。三维视觉研究主要围绕深度图像获取、视觉定位与制图、三维建模及三维理解等任务而展开。本文围绕上述三维视觉任务,对国内外研究进展进行了综合评述和对比分析。首先,针对深度图像获取任务,从非端到端立体匹配、端到端立体匹配及无监督立体匹配3个方面对立体匹配研究进展进行了回顾,从深度回归网络和深度补全网络两个方面对单目深度估计研究进展进行了回顾。其次,针对视觉定位与制图任务,从端到端视觉定位和非端到端视觉定位两个方面对大场景下的视觉定位研究进展进行了回顾,并从视觉同步定位与地图构建和融合其他传感器的同步定位与地图构建两个方面对同步定位与地图构建的研究进展进行了回顾。再次,针对三维建模任务,从深度三维表征学习、深度三维生成模型、结构化表征学习与生成模型以及基于深度学习的三维重建等4个方面对三维几何建模研究进展进行了回顾,并从多视RGB重建、单深度相机和多深度相机方法以及单视图RGB方法等3个方面对人体动态建模研究进展进行了回顾。最后,针对三维理解任务,从点云语义分割和点云实例分割两个方面对点云语义理解研究进展进行了回顾。在此基础上,给出了三维视觉研究的未来发展趋势,旨在为相关研究者提供参考。
关键词
Recent progress in 3D vision

Long Xiaoxiao1,2, Cheng Xinjing2, Zhu Hao3,2, Zhang Pengju4,5, Liu Haomin6, Li Jun7, Zheng Lintao7, Hu Qingyong8, Liu Hao9, Cao Xun3, Yang Ruigang2, Wu Yihong4,5, Zhang Guofeng10, Liu Yebin11, Xu Kai7, Guo Yulan7, Chen Baoquan12(1.The University of Hong Kong, Hong Kong 999077, China;2.Jiluo Technology(Shanghai) Co., Ltd., Shanghai 200000, China;3.Nanjing University, Nanjing 210023, China;4.Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;5.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;6.SenseTime Research Institute, Hangzhou 311215, China;7.National University of Defense Technology, Changsha 410073, China;8.University of Oxford, Oxford OX13 QR, United Kingdom;9.Sun Yat-sen University, Guangzhou 510275, China;10.Zhejiang University, Hangzhou 310058, China;11.Tsinghua University, Beijing 100085, China;12.Peking University, Beijing 100871, China)

Abstract
3D vision has numerous applications in various areas, such as autonomous vehicles, robotics, digital city, virtual/mixed reality, human-machine interaction, entertainment, and sports. It covers a broad variety of research topics, ranging from 3D data acquisition, 3D modeling, shape analysis, rendering, to interaction. With the rapid development of 3D acquisition sensors (such as low-cost LiDARs, depth cameras, and 3D scanners), 3D data become even more accessible and available. Moreover, the advances in deep learning techniques further boost the development of 3D vision, with a large number of algorithms being proposed recently. We provide a comprehensive review on progress of 3D vision algorithms in recent few years, mostly in the last year. This survey covers seven different topics, including stereo matching, monocular depth estimation, visual localization in large-scale scenes, simultaneous localization and mapping (SLAM), 3D geometric modeling, dynamic human modeling, and point cloud understanding. Although several surveys are now available in the area of 3D vision, this survey is different from few aspects. First, this study covers a wide range of topics in 3D vision and can therefore benefit a broad research community. On the contrary, most existing works mainly focus on a specific topic, such as depth estimation or point cloud learning. Second, this study mainly focuses on the progress in very recent years. Therefore, it can provide the readers with up-to-date information. Third, this paper presents a direct comparison between the progresses in China and abroad. The recent progress in depth image acquisition, including stereo matching and monocular depth estimation, is initially reviewed. The stereo matching algorithms are divided into non-end-to-end stereo matching, end-to-end stereo matching, and unsupervised stereo matching algorithms. The monocular depth estimation algorithms are categorized into depth regression networks and depth completion networks. The depth regression networks are further divided into encoder-decoder networks and composite networks. Then, the recent progress in visual localization, including visual localization in large-scale scenes and SLAM is reviewed. The visual localization algorithms for large-scale scenes are divided into end-to-end and non-end-to-end algorithms, and these non-end-to-end algorithms are further categorized into deep learning-based feature description algorithms, 2D image retrieval-based visual localization algorithms, 2D-3D matching-based visual localization algorithms, and visual localization algorithms based on the fusion of 2D image retrieval and 2D-3D matching. SLAM algorithms are divided into visual SLAM algorithms and multisensor fusion based SLAM algorithms. The recent progress in 3D modeling and understanding, including 3D geometric modeling, dynamic human modeling, and point cloud understanding is further reviewed. 3D geometric modeling algorithms consist of several components, including deep 3D representation learning, deep 3D generative models, structured representation learning and generative models, and deep learning-based 3D modeling. Dynamic human modeling algorithms are divided into multiview RGB modeling algorithms, single-depth camera-based and multiple-depth camera-based algorithms, and single-view RGB modeling methods. Point cloud understanding algorithms are further categorized into semantic segmentation methods and instance segmentation methods for point clouds. The paper is organized as follows. In Section 1, we present the progress in 3D vision outside China. In Section 2, we introduce the progress of 3D vision in China. In Section 3, the 3D vision techniques developed in China and abroad are compared and analyzed. In Section 4, we point out several future research directions in the area.
Keywords

订阅号|日报