平行视觉的基本框架与关键算法
张慧1,2, 李轩3,4, 王飞跃1(1.中国科学院自动化研究所复杂系统管理与控制国家重点实验室, 北京 100190;2.腾讯科技(北京)有限公司, 北京 100080;3.鹏城实验室, 深圳 518055;4.北京理工大学, 北京 100081) 摘 要
目的 随着计算机与人工智能的快速发展,视觉感知技术突飞猛进。然而,以深度学习为主的视觉感知方法依赖于大规模多样性的数据集,因此,本文提出了基于平行学习的视觉分析框架——平行视觉,它通过大量精细标注的人工图像来给视觉算法补充足够的图像数据,从而将计算机变成计算智能的“实验室”。方法 首先人工图像系统模拟实际图像中可能出现的成像条件,利用系统内部参数自动得到标注信息,获取符合要求的人工图像数据;然后使用预测学习设计视觉感知模型,利用计算实验方法在人工图像系统生成的大量图像数据上进行各种实验,方便地研究复杂环境条件等困难场景对视觉感知模型的影响,使一些实际中的不可控因素转变为可控因素,增加视觉模型的可解释性;最后通过指示学习反馈优化模型参数,利用视觉感知模型在实际场景下存在的困难来指导其在人工场景的训练,以实际与人工虚实互动的方式,在线学习和优化视觉感知模型。由于已经有大量研究人员致力于构建人工场景并生成大量虚拟图像,因此本文采用已构建的这些人工场景图像,并对实际场景图像进行翻转、裁剪、缩放等数据扩充,然后以计算实验和预测学习为重点,开展了相关的应用实例研究。结果 在SYNTHIA(synthetic collection of imagery and annotations),Virtual KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute)和VIPER(visual perception benchmark)数据集上进行的大量实验表明,本文方法能够有效地克服数据集分布差异对模型泛化能力的影响,性能优于同期最好的方法,比如在SYNTHIA数据集上检测和分割性能分别提升了3.8%和2.7%。结论 平行视觉是视觉计算领域的一个重要研究方向,通过与深度学习的结合,将推动越来越多的智能视觉系统发展成熟并走向应用。
关键词
The basic framework and key algorithms of parallel vision
Zhang Hui1,2, Li Xuan3,4, Wang Feiyue1(1.The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;2.Tencent Technology(Beijing) Company, Limited, Beijing 100080, China;3.Peng Cheng Laboratory, Shenzhen 518055, China;4.Beijing Institute of Technology, Beijing 100081, China) Abstract
Objective Computer vision makes the camera and computer the “eyes” of the computer, which can have the abilities of segmentation, classification, recognition, tracking and decision-making. In recent years, computer vision technology has been widely used in intelligent transportation, unmanned driving, robot navigation, intelligent video monitoring, and many other fields. At present, the camera has become the most commonly used sensing equipment in automatic driving and smart cities, generating massive image and video data. We can realize real-time analysis and processing of these data only by relying on computer vision technology. We can detect all kinds of objects in real time and obtain their position and motion states accurately from the image video. However, the actual scene has a very high complexity. Many complex factors interweave together, which poses a great challenge to the visual computing system. At present, computer vision technology is mainly based on the deep learning method through large-scale data-driven mechanisms. Sufficient data are needed due to the heavy dependence of its training algorithm mechanism on datasets. However, collecting and labeling large-scale image data from actual scenes are time-consuming and labor-intensive tasks, and usually, only small-scale and limited diversity of image data can be obtained. For example, Microsoft common objects in context(MS COCO), a popular dataset used for instance segmentation tasks, has a size of about 300 000 and mainly 91 categories. Expressing the complexity of reality and simulate the real situation is difficult. The model trained on the limited dataset will lack practical significance, because the dataset is not large enough to represent the real data distribution and cannot guarantee the effectiveness of practical application. Method The theory of social computing and parallel systems is proposed based on artificial systems, computational experiments, and parallel execution (ACP). The ACP methodology plays an essential role in modeling and control of complex systems. A virtual artificial society is constructed to connect the virtual and the real world through parallel management. On the basis of the existing facts, artificial system is used to model the behavior of complex systems by using advanced computing experiments and then analyze its behavior and interact with reality to obtain a better operating system than reality. To address the bottleneck of deep learning in the field of computer vision, this paper proposes parallel vision, a visual analysis framework based on parallel learning. Parallel vision is an intelligent visual perception framework that is an extension of the ACP methodology into the computer vision field. In the framework of parallel vision, large-scale realistic artificial images can be obtained easily to give support to the vision algorithm with enough well-labeled image data. In this way, the computer can be turned into a “laboratory” of computational intelligence. First, the artificial image system simulates the imaging conditions that may appear in the actual image, uses the internal parameters of the system to automatically obtain the annotation information, and obtains the required artificial images. Then, we use the predictive learning method to design the visual perception model, and then we use the computational experiment method to conduct experiments. Various experiments are conducted on a rich supply of image data generated in the artificial image system. Studying the influence of difficult scenes such as complex environmental conditions on the visual perception model is convenient; thus, some uncontrollable factors in practice can be transformed into controllable factors, and the interpretability of the visual model is increased. Finally, we use prescriptive learning method to optimize model parameters. The difficulty of the visual perception model in the actual scene can be used to guide the model training in the artificial scene. We learn and optimize the visual perception model online through virtual-real interaction. This paper also conducted an application case study to preliminarily demonstrate the effectiveness of the proposed framework. This case can work over synthetic images with accurate annotations and real images without any labels. The virtual-real interaction guides the model to learn useful information from synthetic data while keeping consistent with real data. We first analyze the data distribution discrepancy from a probabilistic perspective and divide it into image-level and instance-level discrepancies. Then, we design two components to align these discrepancies, i.e., global-level alignment and local-level alignment. Furthermore, a consistency alignment component is proposed to encourage the consistency between the global-level and the local-level alignment components. Result We evaluate the proposed approach on the real Cityscapes dataset by adapting from virtual SYNTHIA(synthetic collection of imagery and annotations), Virtual KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute), and VIPER(visual perception benchmark) datasets. Experimental results demonstrate that it achieves significantly better performance than state-of-the-art methods. Conclusion Parallel vision is an important research direction in the field of visual computing. Through combination with deep learning, more and more intelligent vision systems will be developed and applied.
Keywords
computer vision parallel learning parallel vision visual perception model instance segmentation object detection
|