融合多视图一致和互补信息的深度3D模型分类

吴晗; 胡良臣; 杨影; 接标; 罗永龙

发布时间： 2024-09-04
摘要点击次数： 94
全文下载次数： 22
DOI:
| Volume | Number

融合多视图一致和互补信息的深度3D模型分类

吴晗, 胡良臣, 杨影, 接标, 罗永龙(安徽师范大学)

摘要

目的基于深度学习的方法在3D模型分类任务中取得了先进的性能，此类方法需要提取三维模型不同数据表示的特征，例如使用深度学习模型提取多视图特征并将其组合成单一而紧凑的形状描述符。然而，这些方法只考虑了多视图之间的一致信息，而忽视了视图与视图之间存在的差异信息。为了解决这一问题，本文提出了新的特征网络学习3D模型多视图数据表示的一致信息和互补信息，并将其有效融合，以充分利用多视图数据的特征，提高3D模型分类准确率。方法该方法通过在残差网络的残差结构中引入空洞卷积，扩大卷积操作的感受野。随后，对网络结构进行调整以进行多视图特征提取。然后，通过设计的视图分类网络划分一致信息和互补信息，充分利用每个视图。为了处理这两类不同的信息，引入了一种结合注意力机制的学习融合策略，将两类特征视图融合，从而得到形状级描述符，实现可靠的3D模型分类。结果该模型的有效性在ModelNet数据集的两个子集上得到验证。在基于ModelNet40数据集的所有对比方法中具有最好的性能表现。为了对比不同的特征提取网络，设置单分类任务实现，本文方法在分类准确度和平均损失方面表现最好。相较于基准方法—多视图卷积神经网络（Multi-view Convolutional Neural Network, MVCNN），在不同视图数下本文方法的性能最高提升了3.6%，整体分类准确度提高了5.43%。结论本文提出的一种多视图信息融合的深度3D模型分类网络，深度融合多视图的一致信息和互补信息，在3D模型分类任务中获得明显的效果。并且实验结果表明，相比于现有相关方法，本文方法展现出一定的优越性。

关键词

多视图分类 3D模型分类一致性与互补性改进残差网络视图融合

Multi-view Consistent and Complementary Information Fusion Method for 3D Model Classification

wuhan, huliangchen, yangying, jiebiao, luoyonglong(安徽师范大学)

Abstract

Objective 3D model classification holds significant promise across diverse applications, including autonomous driving, game design, and 3D printing. With the rapid advancement of deep learning, numerous deep neural networks have been in-vestigated for 3D model classification. Among these approaches, view-based methods consistently outperform voxel mesh and 3D point cloud-based methods. The view-based method captures multiple 2D perspectives from various angles of 3D objects to represent their 3D information. This approach closely aligns with human visual processing, transforming 3D prob-lems into manageable 2D tasks solvable through standard convolutional neural networks (CNNs). In contrast, voxel-based and point cloud-based methods primarily focus on the spatial characteristics of 3D models, necessitating the generation of substantial datasets. The view-based method, by obtaining multiple 2D views from different angles of 3D models, visually transforms 3D challenges into 2D tasks, mirroring human approaches and facilitating resolution through conventional CNNs. The utilization of CNNs typically involves employing established models such as the Visual Geometry Group Network (VGG), the Inception Network (GoogleNet), and the Residual Network (ResNet) to derive a view representation of 3D models. Methods like MVCNN and the Group-view Convolutional Neural Network (GVCNN) leverage pre-trained network weights to obtain view descriptors for multiple perspectives. However, these approaches often neglect the complementary information between views, an aspect crucial for shaping the final descriptor. As shape descriptors serve as the ultimate recognition task, acquiring 3D model shape descriptors through view descriptors remains a fundamental challenge for achieving optimal 3D model classification. Recent studies, including MVCNN and the Dynamic Routing Convolutional Neural Network (DRCNN), employ a view pooling scheme to generate discriminative descriptors from feature representations of multiple views, marking significant milestones in 3D model classification with notable performance improvements. It is noteworthy that existing methods inadequately exploit the view characteristics among multiple perspectives, severely limiting the efficacy of shape descriptors. On one hand, the inherent differences in the two-dimensional views projected from three-dimensional objects constitute complementary information, enhancing the generation of the final shape descriptor. On the other hand, each 2D view can to some extent represent its corresponding 3D object, signifying consistent features between views. Integrating these consistent features enhances the accuracy of recognition tasks. Consequently, learning complementary information between views and integrating it with consistent information emerges as a critical aspect for advancing 3D model classification. Meth-od Addressing this challenge, our paper introduces a network model that amalgamates complementary and consistent infor-mation gleaned from multiple views, thereby augmenting the comprehensiveness of information. Specifically, the model aims to fuse association information between views for 3D object classification. Initially, an enhanced residual network is em-ployed to extract feature representations from multiple views, yielding view descriptors. Subsequently, a pre-classification network, coupled with an attention mechanism and weight learning strategy, is utilized to fuse these view descriptors and gen-erate shape descriptors. To enhance the residual structure of the ResNet model, we introduce multi-scale dilated convolution after the ordinary convolution within the residual module during one-way network propagation in a single view. This augmen-tation extends the receptive field of the convolution operation, facilitating the extraction of complementary information. Ad-ditionally, a pre-classification module is proposed to gauge the recognition degree of each view based on shape characteristics. Using this information, views are categorized into complementary and consistent views. A subset of both types of views is fused into feature views, each possessing two characteristics. These feature views are input into an attention network to rein-force consistency and complementarity. Subsequently, a learnable weight fusion module is applied to the two feature views, weighting and fusing them to generate shape descriptors. Finally, we refine the overall network structure and strategically position the pre-classification layer and attention layer to ensure optimal outcomes for the proposed methodology. Result This study conducted a series of experiments to validate the effectiveness of the proposed model on the ModelNet10 and ModelNet40 datasets. Initially, an ablation experiment was performed, focusing on the comparison of module insertion posi-tions, feature extraction networks, and the number of views. The experimental results indicate that tightly coupling the pre-classification module with the attention module and inserting them into the second or third layer of the residual network yields superior final classification accuracy compared to insertion between other layers. The method introduced in this study demonstrates higher average single-class classification accuracy than models like ResNet50 under an equivalent number of training iterations. Additionally, it exhibits lower average losses and more robust convergence of losses in terms of loss reduc-tion. We further evaluated the performance of the model across varying numbers of views. Irrespective of the number of views, our method consistently outperforms or matches MVCNN and DRCNN. With an increase in the number of views from 3 to 6 to 12, both our model and the DRCNN model exhibit a continuous improvement in accuracy. Finally, when compared to 15 classic multi-view 3D object classification algorithms, our model achieved an average instance accuracy of 97.27% in ModelNet10 using the designed feature extraction network with the same 12 views. However, it falls slightly behind DRCNN, possibly due to limited data in the ModelNet10 dataset leading to overfitting and reduced classification accuracy. In Mod-elNet40, our model achieved an average instance accuracy of 96.32%. The comparison with ResNet50 and VGG-M also highlights certain advantages in classification accuracy for our model, affirming its ability to more effectively extract information between views on the backbone network than other methods. Conclusion In this study, we present a robust deep 3D object classification method that leverages multi-view information fusion to integrate both consistency and complementarity information among views through network structure adjustments. The experimental findings substantiate the favorable per-formance of the proposed model.

Keywords

Multi-view learning, 3D model classification, Consistency and complementarity, Improving residual network, View fusion

在线采编平台

论文出版

年度会议

下载中心

年度信息