多特征融合与残差优化的点云语义分割方法
摘 要
目的 当前的大场景3维点云语义分割方法一般是将大规模点云切成点云块再进行处理。然而在实际计算过程中,切割边界的几何特征容易被破坏,使得分割结果呈现明显的边界现象。因此,迫切需要以原始点云作为输入的高效深度学习网络模型,用于点云的语义分割。方法 为了解决该问题,提出基于多特征融合与残差优化的点云语义分割方法。网络通过一个多特征提取模块来提取每个点的几何结构特征以及语义特征,通过对特征的加权获取特征集合。在此基础上,引入注意力机制优化特征集合,构建特征聚合模块,聚合点云中最具辨别力的特征。最后在特征聚合模块中添加残差块,优化网络训练。最终网络的输出是每个点在数据集中各个类别的置信度。结果 本文提出的残差网络模型在S3DIS (Stanford Large-scale 3D Indoor Spaces Dataset)与户外场景点云分割数据集Semantic3D等2个数据集上与当前的主流算法进行了分割精度的对比。在S3DIS数据集中,本文算法在全局准确率以及平均准确率上均取得了较高精度,分别为87.2%,81.7%。在Semantic3D数据集上,本文算法在全局准确率和平均交并比上均取得了较高精度,分别为93.5%,74.0%,比GACNet (graph attention convolution network)分别高1.6%,3.2%。结论 实验结果验证了本文提出的残差优化网络在大规模点云语义分割的应用中,可以缓解深层次特征提取过程中梯度消失和网络过拟合现象并保持良好的分割性能。
关键词
Point cloud semantic segmentation method based on multi-feature fusion and residual optimization
Du Jing, Cai Guorong(School of Computer Engineering, Jimei University, Xiamen 361000, China) Abstract
Objective The semantic segmentation of 3D point cloud is to take the point cloud as input and output the semantic label of each point according to the category. However, the existing semantic segmentation methods based on 3D point cloud are mainly limited to processing on small-scale point cloud blocks. When a large-scale point cloud is cut into small point clouds, the geometric features of the cut boundary can be easily destroyed, which results in obvious boundary phenomena. In addition, traditional semantic segmentation deep networks have difficult in meeting the computational efficiency requirements of large-scale data. Therefore, an efficient deep learning network model that directly takes the original point cloud as input for point cloud semantic segmentation is urgently needed. However, most networks still choose to input small point cloud blocks for training, mainly because there are many difficulties in directly handling point clouds in large scenes. The first is that the spatial size and number of points of the 3D point cloud data collected through sensor scanning are uncertain. This requires that the network does not have a specific number of input points and is not sensitive to the number of points. Second, the geometric structure of large scenes is more complicated than that of small-scale point cloud blocks, which increases the difficulty of segmentation. The third is that the direct processing of point clouds in large scenes will bring a lot of calculations, which poses a huge challenge to existing graphics processing unit(GPU) memory. The main obstacle to be overcome by our framework is to directly deal with large-scale 3D point clouds. For different point cloud spatial structures and points, they can be directly input into the network for training under the condition of ensuring time complexity and space complexity. Method In this study, a residual optimization network based on large-scale point cloud semantic segmentation is proposed. First, we choose random sampling as the down-sampling strategy, and its calculation time is independent of the number of points. Each coding layer has a random sampling module. This design makes it possible to gradually increase the dimension of each point feature while gradually reducing the size of the point cloud. The input to the network is the entire large-scale 3D point cloud scene. At the same time, a local feature extraction module is designed to capture the neighbor feature, geometric feature, and semantic feature of each point. The final feature set is obtained by weighted summation of the three types of features. The network introduces an attention mechanism to optimize the feature set, thereby further building a feature aggregation module to aggregate the most discriminative features in the point cloud. Finally, the residual block is added in the feature aggregation module to optimize the training of the network. The network adopts the encoder-decoder structure to realize the construction of the network framework. Different from the traditional encoder-decoder structure, this study adjusts the internal structure of each layer of the encoder for the special application scenario of large scene point cloud, including the down-sampling ratio of point cloud and the dimension of feature output. The output of the network is the score of each point in each category in the dataset. In summary, the network first passes the input point cloud through the multilayer perceptron(MLP) layer to extract the features of center point. Then, five encoding and five decoding layers are used to learn the features of each point. Finally, three fully connected layers are used to predict the semantic label of each point. Result The experiment was compared with the latest methods on two datasets, including Stanford large-scale 3D Indoor Spaces Dataset(S3DIS) dataset and Semantic3D dataset. The S3DIS dataset contains 12 semantic elements, which is more fine-grained and challenging than multiple semantic indoor segmentation datasets. Four criteria such as intersection over union (IoU), mean IoU (mIoU), mean accuracy (mAcc), and overall accuracy (OA) were evaluated. We set the k value of the k-nearest neighbor algorithm to 16. The batch number for training is 8, and the batch number for evaluation is 24. The number of training and verification steps for each epoch is 500 and 100, respectively. The maximum epoch during testing is 150. The experiment was conducted using a NVIDIA GTX 1080 Ti GPU. On the S3DIS dataset, our algorithm achieves the best accuracy in OA and mAcc. Compared with the super point graphs (SPG) network that also takes the entire large scene as input, the proposed algorithm improves the OA, mAcc, and mIoU by 1.7%, 8.7%, and 3.8%, respectively. The Semantic3D dataset contains eight semantic classes, covering a wide range of urban outdoor scenes:churches, streets, railroad tracks, squares, villages, football fields, castles, etc. There are about 4 billion manually marked points and various natural and artificial scenes to prevent overfitting of the classifier. We set the batch number for training to 4 and the batch number for evaluation to 16. Other settings are the same as S3DIS. On the Semantic3D dataset, the mIoU value is increased by 3.2%, and the OA is increased by 1.6% compared with the latest algorithm GACNet. Our network also achieved optimal accuracy in multiple categories such as high vegetation, buildings, and remaining hard landscapes. This verifies the outstanding performance of the residual optimization network proposed in large-scale point cloud semantic segmentation in this study, which solves the problem of gradient disappearance and network degradation in the process of feature extraction. Conclusion We propose a new semantic segmentation framework that introduces the residual network into the semantic segmentation of large-scale point clouds, thereby increasing the network layers and extracting more distinguishing features. Our network shows excellent performance in multiple datasets, making the semantic segmentation results more accurate.
Keywords
computer vision three-dimensional point cloud large scene semantic segmentation multi-feature fusion residual network
|