Current Issue Cover
结合混合域注意力与空洞卷积的3维目标检测

严娟, 方志军, 高永彬(上海工程技术大学电子电气工程学院, 上海 201620)

摘 要
目的 通过深度学习卷积神经网络进行3维目标检测的方法已取得巨大进展,但卷积神经网络提取的特征既缺乏不同区域特征的依赖关系,也缺乏不同通道特征的依赖关系,同时难以保证在无损空间分辨率的情况下扩大感受野。针对以上不足,提出了一种结合混合域注意力与空洞卷积的3维目标检测方法。方法 在输入层融入空间域注意力机制,变换输入信息的空间位置,保留需重点关注的区域特征;在网络中融入通道域注意力机制,提取特征的通道权重,获取关键通道特征;通过融合空间域与通道域注意力机制,对特征进行混合空间与通道的混合注意。在特征提取器的输出层融入结合空洞卷积与通道注意力机制的网络层,在不损失空间分辨率的情况下扩大感受野,根据不同感受野提取特征的通道权重后进行融合,得到全局感受野的关键通道特征;引入特征金字塔结构构建特征提取器,提取高分辨率的特征图,大幅提升网络的检测性能。运用基于二阶段的区域生成网络,回归定位更准确的3维目标框。结果 KITTI(A project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)数据集中的实验结果表明,在物体被遮挡的程度由轻到高时,对测试集中的car类别,3维目标检测框的平均精度AP3D值分别为83.45%、74.29%、67.92%,鸟瞰视角2维目标检测框的平均精度APBEV值分别为89.61%、87.05%、79.69%; 对pedestrian和cyclist 类别,AP3DAPBEV值同样比其他方法的检测结果有一定优势。结论 本文提出的3维目标检测网络,一定程度上解决了3维检测任务中卷积神经网络提取的特征缺乏视觉注意力的问题,从而使3维目标检测更有效地运用于室外自动驾驶。
关键词
3D object detection based on domain attention and dilated convolution

Yan Juan, Fang Zhijun, Gao Yongbin(Department of Electrical and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 201620, China)

Abstract
Objective With the continuous development of convolutional neural network (CNN) used in deep learning in recent years, 3D object detection networks based on deep learning have also made outstanding development. 3D object detection aims to identify the class, location, orientation, and size of a target object in 3D space. It is widely used in the visual field, such as autonomous driving, intelligent monitoring, and medical analysis. The feature extracted by a deep learning network is important in detection accuracy. The detection task is similar to human vision; that is, it also needs to distinguish the difference between the background and the objects. In human vision, attention is given to target objects, while the background is disregarded. Therefore, paying more attention to the target area and less attention to the background area is better when performing object detection in an image. However, a CNN does not distinguish which areas and channels in an image should be given more and less attention. Thus, the features extracted by a CNN not only lack the dependence relationship between different regions but also the dependence relationship between different channels. The current 3D object detection method based on a deep learning network uses a combination of pooling layers behind the multilayer convolution layer. These network structures generally use maximum or averaging pooling in feature maps. They aim to adjust the receptive field size of the extracted features. However, transforming the receptive field of the features of the pooling layers must be performed by removing some information, causing a considerable loss of feature information. Information loss may result in detected errors. Therefore, a CNN should expand the receptive field without losing information, obtaining good detection results. To address the shortcomings of the aforementioned 3D target detection methods, this study proposes a two-stage 3D object detection network that combines mixed domain attention and dilated convolution. Method In this study, a 3D object detection network based on a deep learning network is built. Integrating the spatial domain attention mechanism into the input layer of the network transforms the spatial position of the input information, preserving regional features that require more attention. Incorporating the channel domain attention mechanism into the network computes the channel weights of the extracted features, obtaining the key channel features. The features are mixed by combining the aforementioned spatial and channel domain attention mechanisms. Second, the output layer of the feature extractor integrates the network layer that is combined with the dilated convolution and the channel domain attention mechanism, and thus, our network can expand the receptive field of the extracted features without losing spatial resolution. In accordance with the different obtained receptive fields, the features can determine their channel weights and then fuse these feature weights through different schemes to obtain the channel weights of their global receptive fields and identify key channel features. In addition, the feature pyramid network structure is introduced to construct the feature extractor of our network, through which our network can extract high-resolution feature maps, considerably improving the detection performance of our network. Lastly, our network architecture is based on a two-stage region proposal network, which can regress to accurate 3D bounding boxes. Result A series of experiments has been conducted on the KITTI(A project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) dataset using the method proposed in this study. Cases wherein the object is slightly to severely occluded are denoted as “easy”, “moderate”, and “hard” in the tables. In the car class in the test set, the values of AP3D that represents average accuracy of 3D object detection box obtained are 83.45%, 74.29%, and 67.92%; and the values of APBEV that represents average accuracy of 2D detection box from bird’s eye view obtained are 89.61%, 87.05%, and 79.69%. In the pedestrian class, the values of AP3D obtained are 52.23%, 44.91%, and 41.64%; and the values of APBEV obtained are 59.73%, 53.97%, and 49.62%. In the cyclist class, the values of AP3D obtained are 65.02%, 54.38%, and 47.97%; and the values of APBEVobtained are 69.13%, 59.69%, and 52.11%. We also perform ablation experiments on the test set. The experiment results show that in the car class and relative to the proposed method, the average value of AP3D obtained after removing the pyramid structure is reduced by approximately 6.09%, the average value of AP3D obtained after removing the mixed domain attention structure is reduced by approximately 0.99%, and the average value of AP3D obtained after removing the dilated convolution structure is reduced by approximately 0.71%. Conclusion For the research on 3D object detection task, we propose a two-stage 3D object detection network that combines dilated convolution and mixed domain attention. The experiment results show that the proposed method outperforms several existing state-of-the-art 3D object detection methods and obtains accurate detection results, and it can be effectively applied to outdoor automatic driving.
Keywords

订阅号|日报