Current Issue Cover
自注意力融合调制的弱监督语义分割

石德硕, 李军侠, 刘青山(南京信息工程大学江苏省大气环境与装备技术协同创新中心, 南京 210044)

摘 要
目的 现有图像级标注的弱监督分割方法大多利用卷积神经网络获取伪标签,其覆盖的目标区域往往过小。基于Transformer的方法通常采用自注意力对类激活图进行扩张,然而受其深层注意力不准确性的影响,优化之后得到的伪标签中背景噪声比较多。为了利用该两类特征提取网络的优点,同时结合Transformer不同层级的注意力特性,构建了一种结合卷积特征和Transformer特征的自注意力融合调制网络进行弱监督语义分割。方法 采用卷积增强的Transformer (Conformer)作为特征提取网络,其能够对图像进行更加全面的编码,得到初始的类激活图。设计了一种自注意力层级自适应融合模块,根据自注意力值和层级重要性生成融合权重,融合之后的自注意力能够较好地抑制背景噪声。提出了一种自注意力调制模块,利用像素对之间的注意力关系,设计调制函数,增大前景像素的激活响应。使用调制后的注意力对初始类激活图进行优化,使其覆盖较多的目标区域,同时有效抑制背景噪声。结果 在最常用的PASCAL VOC 2012(pattern analysis,statistical modeling and computational learning visual object classes 2012)数据集和COCO 2014 (common objectes in context 2014)数据集上利用获得的伪标签进行分割网络的训练,在对比实验中本文算法均取得最优结果,在PASCAL VOC验证集上,平均交并比(mean intersection over union,mIoU)达到了70.2%,测试集上mIoU值为70.5%,相比对比算法中最优的Transformer模型,其性能在验证集和测试集上均提升了0.9%,相比于卷积神经网络最优方法,验证集上mIoU提升了0.7%,测试集上mIoU值提升了0.8%。在COCO 2014验证集上结果为40.1%,与对比算法中最优方法相比分割精度提高了0.5%。结论 本文提出的弱监督语义分割模型,结合了卷积神经网络和Transformer的优点,通过对Transformer自注意力进行自适应融合调制,得到了图像级标签下目前最优的语义分割结果,该方法可应用于三维重建、机器人场景理解等应用领域。此外,所构建的自注意力自适应融合模块和自注意力调制模块均可嵌入到Transformer结构中,为具体视觉任务获取更鲁棒、更具鉴别性的特征。
关键词
Self-attention fusion and modulation for weakly supervised semantic segmentation

Shi Deshuo, Li Junxia, Liu Qingshan(Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China)

Abstract
Objective Semantic segmentation is a fundamental task in computer vision and image processing whose aim is to assign a class label to each pixel. However, the training of the segmentation model often relies on dense pixel-wise annotations, which are both time consuming and labor intensive to collect. To eliminate the dependence of pixel-level labels, weakly supervised semantic segmentation(WSSS)has been widely applied due to its weak/cheap supervision of points, scribbles, image-level labels, and bounding boxes. Image-level and pseudo labels are the weakest/easiest and most difficult labels to obtain, respectively. The main challenge in WSSS based on a convolutional neural network with image-level supervision lies in the naive gap between classification and segmentation tasks, which reduces the activation of the target regions, thus failing to satisfy the requirements of segmentation tasks. Despite activating most of the foreground objects, the classifier using the Transformer also introduces much background noise, thus decreasing the quality of the pseudo masks. In order to take full use of the advantages of these two types of feature extraction networks and combine the attention features of different levels of the Transformer, a self-attention fusion and modulation network is constructed in this paper for weakly supervised semantic segmentation. Method To make full use of the local features extracted by the convolutional neural network and the global features extracted by the Transformer, this paper turns to the convolution-enhanced Transformer (Conformer)as the feature extraction network that can encode the image comprehensively and obtain the initial class activation maps. The attention maps learned by the Transformer branch differ between the shallow and deep layers. Influenced by the convolution information, the attention maps in shallow layers tend to capture detailed information of the targets regions, while those maps in deeper layers prefer mining the global information. Meanwhile, the noises in the background regions are caused by the attention maps in deeper layers owing to the incorrect relation between the background and foreground. Therefore, adding different attention maps directly is a suboptimal choice. We propose a self-attention adaptive fusion module that assigns a weight to each layer to balance their importance. On the one hand, we argue that the attention maps in shallow layers are more accurate than those in deeper layers, so large and small weights are distributed to the maps in the shallow and deep layer maps, respectively, to reduce the influence of the noise caused by deep layers. On the other hand, we consider the discrete activation value of the attention map, and a map with a larger discrete activation value has greater importance. The fused self-attention can effectively suppress background noises and describe the similarity between pixel pairs. In order to further increase the activation response of foreground pixels, we design a self-attention modulation module. We initially normalize the attention map before mapping it via the exponential function to measure the importance of each pixel pair. Given that the target object pixels are relatively similar and that the attention value of the pixel pair may be larger than that of others, we increase this connection via a large modulation parameter. When a pixel pair has a small attention value, these pixels may not have a close relation, thus introducing some noise. Therefore, we reduce this connection via a small modulation parameter. After modulating the attention map, the distance between the foreground and background pixels becomes large, and the attention maps pay more attention to the foreground regions than the background ones. Result Our experiment results demonstrate that our model can achieve state-of-the-art performance. Our model obtains a 70. 2% mean intersection over union(mIoU)in the validation set, 70. 5% mIoU in the test set of the most popular PASCAL VOC 2012 dataset, and 40. 1% mIoU in the validation set of COCO 2014. We do not utilize the saliency maps to provide the background cues, and our results are comparable to those works using saliency maps. Our model outperforms the state-of-the-art multi-class token Transformer(MCTformer)model, which uses the Transformer structure for feature extraction, by 2% and 2. 1% in the validation and test sets, respectively, in terms of mIoU. Compared to TransCAM, which directly uses attention to adjust the class activation maps, our model obtains a 0. 9% performance boost both in the validation and test sets, hence demonstrating that our model can effectively reduce noise in background regions. Our model also outperforms IRNet, SEAM, AMR, SIPE, and URN, which use the convolutional neural network as their backbone, by 6. 7%, 5. 7%, 1. 4%, 1. 4%, and 0. 7%, respectively, in the validation set, thus confirming that our dual branch feature extraction structure is effective and feasible. Given that we extract our features from both the local and global aspects, we also conduct an ablation experiment to show the importance of the completement of information. If we only use the information of the convolution branch, then we obtain 27. 7% mIoU of the class activation map(CAM). However, when fused with the global feature generated by the Transformer branch, we obtain 35. 1% mIoU, thereby indicating that both local and global information are helpful in generating CAM. Conclusion The proposed self-attention adaptive fusion and modulation network in this paper is effective for image-level weakly supervised semantic segmentation tasks.
Keywords

订阅号|日报