Current Issue Cover
融合点云深度信息的3D目标检测与分类

周昊1, 齐洪钢1, 邓永强2, 李娟娟2, 梁浩2, 苗军3(1.中国科学院大学计算机科学与技术学院, 北京 101408;2.北京万集科技股份有限公司, 北京 100080;3.北京信息科技大学计算机学院, 北京 100192)

摘 要
目的 基于点云的3D目标检测是自动驾驶领域的重要技术之一。由于点云的非结构化特性,通常将点云进行体素化处理,然后基于体素特征完成3D目标检测任务。在基于体素的3D目标检测算法中,对点云进行体素化时会导致部分点云的数据信息和结构信息的损失,降低检测效果。针对该问题,本文提出一种融合点云深度信息的方法,有效提高了3D目标检测的精度。方法 首先将点云通过球面投影的方法转换为深度图像,然后将深度图像与3D目标检测算法提取的特征图进行融合,从而对损失信息进行补全。由于此时的融合特征以2D伪图像的形式表示,因此使用YOLOv7(you only look once v7)中的主干网络提取融合特征。最后设计回归与分类网络,将提取到的融合特征送入到网络中预测目标的位置、大小以及类别。结果 本文方法在KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)数据集和DAIR-V2X数据集上进行测试。以AP(average precision )值为评价指标,在KITTI数据集上,改进算法PP-Depth相较于PointPillars在汽车、行人和自行车类别上分别有0.84%、2.3%和1.77%的提升。以自行车简单难度为例,改进算法PP-YOLO-Depth相较于PointPillars、PP-YOLO和PP-Depth分别有5.15%、1.1%和2.75%的提升。在DAIR-V2X数据集上,PP-Depth相较于PointPillars在汽车、行人和自行车类别上分别有17.46%、20.72%和12.7%的提升。以汽车简单难度为例,PP-YOLO-Depth相较于PointPillars、PP-YOLO和PP-Depth分别有13.53%、5.59%和1.08%的提升。结论 本文方法在KITTI数据集和DAIR-V2X数据集上都取得了较好表现,减少了点云在体素化过程中的信息损失并提高了网络对融合特征的提取能力和多尺度目标的检测性能,使目标检测结果更加准确。
关键词
3D object detection and classification combined with point cloud depth information

Zhou Hao1, Qi Honggang1, Deng Yongqiang2, Li Juanjuan2, Liang Hao2, Miao Jun3(1.School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 101408, China;2.VanJee Technology Co., Ltd., Beijing 100080, China;3.School of Computer Science, Beijing Information Science and Technology University, Beijing 100192, China)

Abstract
Objective Perception systems are integral components in modern autonomous driving systems. They are designed to accurately estimate the state of the surrounding environment and provide reliable observations for prediction and planning. 3D object detection can intelligently predict the location, size, and category of key 3D objects near the autonomous vehicle, and it is an important part of the perception system. In 3D object detection, common data types include images and point clouds. Compared with images, a point cloud is a dataset composed of many points in a 3D space, and the position information of each point is represented by coordinates in a 3D coordinate system. In addition to position information, information such as reflection intensity is usually included. In the field of computer vision, point clouds are often used to represent the shape and structure information of 3D objects. Therefore, the 3D object detection method based on point cloud has more real spatial information and often has more advantages in detection accuracy and speed. However, the point cloud is often converted into a 3D voxel grid due to the unstructured nature of the point cloud. Each voxel in the voxel grid is regarded as a 3D feature vector. Then, the 3D convolutional network is used to extract the feature of the voxel, which completes the 3D object detection task based on the voxel feature. In the voxel-based 3D object detection algorithm, the voxelization of the point cloud will lead to the loss of data information and structural information of part of the point cloud, which affects the detection effect. We propose a method that combines point cloud depth information to solve this problem. Our method uses point cloud depth information as fusion information to complement the information lost in the voxelization process. It also uses the efficient YOLOv7-Net network to extract fusion features, improve the detection performance and feature extraction capabilities of multi-scale objects, and effectively increase the accuracy of 3D object detection. Method The point cloud is first converted into a depth image through spherical projection to reduce the information loss of the point cloud during the voxelization process. The depth image refers to a grayscale image generated through the point cloud, which reflects the distance from each point to the origin of the coordinate system in 3D space. Then, the pixel gray value is used to represent the depth information of the point cloud. Therefore, the depth image of the point cloud can provide a rich feature representation for the point cloud, and the depth information of the point cloud can be used as fusion information to complement the information lost in the voxelization process. Thereafter, the depth image is fused with the feature map extracted by the 3D object detection algorithm to complement the information lost in the voxelization process. Given that the fusion features at this time are more in the form of pseudo-images, a more efficient backbone feature extraction network is selected to extract fusion features. The backbone feature extraction network in YOLOv7 uses an adaptive convolution module, which can adaptively adjust the size of the convolution kernel and the size of the receptive field according to the scale. This way improves the detection performance of the network for multi-scale objects. At the same time, the feature fusion module and feature pyramid pooling module of YOLOv7-Net further enhance the feature extraction ability and detection performance of the network. Therefore, we choose to use YOLOv7-Net to extract fusion features. Finally, the classification and regression network is designed, and the extracted fusion features are sent to the classification and regression network to predict the category, position, and size of the object. Result Our method is tested on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) 3D object detection dataset and the DAIR-V2X Object Detection Dataset. Using average precision (AP) as the evaluation index, PP-Depth has improvements of 0.84%, 2.3%, and 1.77% in the categories of cars, pedestrians, and bicycles on the KITTI dataset compared with PointPillars. Using the simple difficulty of bicycles as an example, PP-YOLO-Depth has improvements of 5.15%, 1.1%, and 2.75% compared with PointPillars, PP-YOLO, and PP-Depth, respectively. On the DAIR-V2X dataset, PP-Depth has improvements of 17.46%, 20.72%, and 12.7% in the categories of cars, pedestrians, and bicycles compared with PointPillars. Using the simple difficulty of cars as an example, PP-YOLO-Depth has improvements of 13.53%, 5.59%, and 1.08% compared with PointPillars, PP-YOLO, and PP-Depth, respectively. Conclusion Experimental results show that our method achieves good performance on the KITTI 3D object detection dataset and the DAIR-V2X Object Detection Dataset. It reduces the information loss of the point cloud during the voxelization process and improves the ability of the network to extract fusion features and multi-scale object detection performance. Thus, it obtains more accurate object detection results.
Keywords

订阅号|日报