融合特征增强与互补的手物姿态估计方法
顾思远, 高曙(武汉理工大学) 摘 要
目的 从单个RGB图像进行联合手物姿态估计极具挑战性,因为当手与物体交互时,经常会发生严重的遮挡。此外,现有的手物特征提取网络通常使用特征金字塔网络(feature pyramid network,FPN)融合多尺度特征,然而,基于FPN的方法存在通道信息丢失的问题。针对以上问题,提出手物特征增强互补模型(hand-object feature enhancement complementary,HOFEC)。方法 1)针对通道信息丢失问题,设计基于通道注意力引导的特征金字塔网络(channel attention-guided feature pyramid network,CAG-FPN),将通道注意力机制引入FPN,使得模型在融合多尺度特征过程中更好地关注输入数据中不同通道之间的关系和重要性,并结合基于局部共享的双流网络ResNet-50(50-layer residual network)共同构建手物特征提取网络,提高模型的特征提取能力。2)针对手物交互时相互遮挡问题,设计空间注意力模块,分别增强手物特征,同时提取手物遮挡区域信息,并进一步设计交叉注意力模块,进行手物特征互补,从而充分利用手部区域和物体区域遮挡信息,实现特征增强与互补。3)通过手部解码器与物体解码器分别恢复手部姿态与物体姿态。结果 在HO3D与Dex-ycb数据集上与SOTA模型相比,本文方法在手部姿态估计任务与物体姿态估计任务上均取得了有竞争力的效果。在HO3D数据集上,与最近的10种模型进行了比较,手部姿态估计指标PAMPJPE与PAMPVPE均比次优的HandOccNet提高了0.1mm,物体姿态估计指标ADD-0.1D比次优的HFL-Net提高了2.1%;在Dex-ycb数据集上,与最近的7种模型进行了比较,手部姿态估计指标MPJPE与PAMPJPE分别比次优的HFL-Net提高了0.2mm、0.1mm,物体姿态估计指标ADD-0.1D比次优的HFL-Net提高了6.4%。结论 本文提出的HOFEC模型能够在手物交互场景下同时准确地估计手部姿态与物体姿态(本文方法代码网址:https://github.com/rookiiiie/HOFEC)。
关键词
Hand-object pose estimation method based on fusion feature enhancement and complementary
GU Siyuan, GAO Shu(School of Computer Science and Artificial Intelligence,Wuhan University of Technology) Abstract
Objective Estimating joint hand-object poses from a single RGB image is a highly challenging task, primarily due to the severe occlusions that often occur during hand-object interactions, complicating the identification of critical features. The interactive scenes involving hands and objects typically exhibit a high degree of dynamism and complexity, rendering traditional computer vision techniques ineffective in handling these intricate situations, especially in the presence of significant occlusions. Furthermore, existing hand-object feature extraction networks commonly utilize Feature Pyramid Networks (FPN) to fuse multi-scale features, aiming to capture information across different levels. However, FPN-based methods frequently encounter issues related to the loss of channel information during the feature extraction process, which can directly impact the accuracy of the final pose estimation. To address these challenges, we propose a novel hand-object feature enhancement complementary (HOFEC) model designed to optimize the feature
extraction and fusion processes, thereby enhancing pose estimation performance under complex backgrounds and occlusion conditions. Method 1) In order to effectively address the prevalent issue of channel information loss within feature extraction processes, we have proposed a novel architecture known as the channel attention-guided feature pyramid network (CAG-FPN). This model strategically integrates a channel attention mechanism into the traditional feature pyramid network (FPN) framework, thereby enhancing the model's capability to discern and prioritize the intricate relationships and significance among various channels present in the input data during the critical multi-scale feature fusion stage.The channel attention mechanism operates by dynamically adjusting the weights assigned to different feature channels based on their relevance to the task at hand. This dynamic weighting allows the network to more effectively excavate and leverage crucial feature information that may be pivotal for accurate recognition tasks. Furthermore, we have augmented this architecture with a dual-stream ResNet-50 network designed around the principles of local sharing. This innovative approach enables us to jointly construct a comprehensive hand-object feature extraction network, which significantly enhances the model's overall feature extraction capabilities. As a result, our model exhibits a markedly improved capacity to capture and represent hand and object features, particularly in complex scenes characterized by high variability and occlusion. 2) To effectively tackle the challenges posed by mutual occlusion during hand-object interactions, we have developed a sophisticated spatial attention module designed to simultaneously enhance the features of both the hand and the object while extracting critical information regarding the occluded regions that may hinder visibility. The implementation of the spatial attention mechanism allows the model to selectively focus on significant areas of interest, thereby improving its ability to accurately recognize and interpret occluded regions that are essential for effective pose estimation. In addition to the spatial attention module, we have innovatively designed a cross-attention module that facilitates the exchange of secondary features between the hand and object. This module injects the secondary features of the hand into the primary features of the object and vice versa, thus fostering a robust complementarity between the features of the hand and the object. Through this design, the module effectively integrates occlusion information from both the hand and object regions while employing a correlation matrix to filter out irrelevant background noise. This dual approach ensures that the processes of feature enhancement and mutual complementarity are conducted with a high degree of precision and thoroughness. Consequently, this significantly improves the overall accuracy of pose estimation in scenarios where hand-object interactions are complex and dynamic.3) Through the use of separate hand and object decoders, we are able to recover the poses of the hand and the object independently. These two decoders take into account the interaction effects between the hand and the object during the information fusion process, ensuring that the final output of pose information is characterized by a high degree of accuracy and consistency. This design enables our model to effectively perform pose estimation in complex hand-object interaction scenarios, providing more reliable technical support for practical applications. Result Compared to state-of-the-art (SOTA) models, the proposed method demonstrates competitive performance in both hand pose estimation and object pose estimation tasks on the HO3D and Dex-ycb datasets. On the HO3D dataset, when compared with ten recent models, the hand pose estimation metrics PAMPJPE and PAMPVPE show an improvement of 0.1 mm over the next best model HandOccNet. The object pose estimation metric ADD-0.1D also surpasses the suboptimal HFL-Net by 2.1%. On the Dex-ycb dataset, comparisons with seven recent models reveal that the hand pose estimation metrics MPJPE and PAMPJPE improve by 0.2 mm and 0.1 mm over HFL-Net, while the object pose estimation metric ADD-0.1D shows a 6.4% improvement over HFL-Net. Conclusion The HOFEC model proposed in this paper aims to improve the accuracy of hand-object pose estimation in interactive scenarios by facilitating the complementary information exchange between the hand and the object. By introducing a channel attention mechanism and incorporating shuffling operations, our model not only addresses the issue of channel information loss in feature pyramid networks (FPN) but also further supplements and strengthens features at different scales.We design a feature enhancement module based on spatial attention to enhance hand and object features at the spatial scale while simultaneously extracting the secondary features of both the hand and the object. Through a cross-attention mechanism, we leverage these secondary features to mutually complement the primary features of the hand and object, effectively filtering out irrelevant background information associated with the secondary features. This approach successfully addresses the challenge of underutilizing occlusion information, thereby improving the accuracy of the hand-object pose estimation task.Building upon this foundation, we develop a hand-object decoder that decodes the hand and object separately, ultimately reconstructing the complete poses of both the hand and the object. Experimental results have shown that even in cases of severe occlusion during hand object interaction, the proposed HOFEC model can still accurately estimate the pose of hands and objects.
Keywords
hand-object pose estimation feature extraction network feature enhancement feature complementation attention mechanism
|