Current Issue Cover
动态生成掩膜弱监督语义分割

陈辰1,2, 唐胜1, 李锦涛1(1.中国科学院计算技术研究所前瞻实验室, 北京 100190;2.中国科学院大学, 北京 100049)

摘 要
目的 传统图像语义分割需要的像素级标注数据难以大量获取,图像语义分割的弱监督学习是当前的重要研究方向。弱监督学习是指使用弱标注样本完成监督学习,弱标注比像素级标注的标注速度快、标注方式简单,包括散点、边界框、涂鸦等标注方式。方法 针对现有方法对多层特征利用不充分的问题,提出了一种基于动态掩膜生成的弱监督语义分割方法。该方法以边界框作为初始前景分割轮廓,使用迭代方式通过卷积神经网络(convolutional neural network,CNN) 多层特征获取前景目标的边缘信息,根据边缘信息生成掩膜。迭代的过程中首先使用高层特征对前景目标的大体形状和位置做出估计,得到粗略的物体分割掩膜。然后根据已获得的粗略掩膜,逐层使用CNN 特征对掩膜进行更新。结果 在Pascal VOC(visual object classes) 2012 数据集上取得了78.06% 的分割精度,相比于边界框监督、弱—半监督、掩膜排序和实例剪切方法,分别提高了14.71%、4.04%、3.10% 和0.92%。结论 该方法能够利用高层语义特征,减少分割掩膜中语义级别的错误,同时使用底层特征对掩膜进行更新,可以提高分割边缘的准确性。
关键词
Weakly supervised semantic segmentation based on dynamic mask generation

Chen Chen1,2, Tang Sheng1, Li Jintao1(1.Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;2.University of Chinese Academy of Sciences, Beijing 100049, China)

Abstract
Objective Image semantic segmentation is an important research topic in the field of computer vision. It refers to dividing an input image into multiple regions with semantic meaning, i.e., assigning a semantic category to each pixel in the image. Many studies on image semantic segmentation based on deep learning have been conducted recently in China and overseas. Current mainstream methods are based on supervised deep learning. However, deep learning requires a large number of training samples, and the image semantic segmentation problem requires category labeling for each pixel in the training sample. On the one hand, pixel-level labeling is difficult. On the other hand, a large number of sample labels means high manual labeling costs. Therefore, image semantic segmentation based on weak supervision has become a research focus in recent years. Weakly supervised learning uses a weak label that is faster and easier to obtain, such as points, bounding boxes, and scribbles, for training. The major difficulty in weakly supervised learning is that weakly labeled data do not contain the location and contour information required for training. Method To solve the problem of missing edge information in a weak label for semantic segmentation, our primary objective is to fully utilize multilayer features extracted by a convolutional neural network (CNN). Our contributions include the following: first, a dynamic mask generation method for extracting the edges of image foreground targets is proposed. The method uses a bounding box as the initial foreground edge contour and iteratively adjusts it with the multilayer features of a CNN with a Gaussian mixture model. The input data of the dynamic mask generation method include bounding box label data and CNN feature maps. During each iteration, eigenvectors from a specific feature map are normalized and used to initialize the Gaussian mixture model, whose training samples are selected in accordance with the edges generated in the last iteration. The probability of all the sample points with respect to the Gaussian mixture model is calculated, and a fine-tuned contour is generated on the basis of these probabilities. In our dynamic mask generation process, the final mask generation iteration uses the original image feature to improve edge accuracy. Simultaneously, high-level features are used for mask initialization to reduce semantic level errors in edge information. Second, a weak supervised semantic segmentation method based on dynamic mask generation is proposed. The generated dynamic mask is used as supervision information in the semantic segmentation training process to feedback the CNN. In each training step, the mask is dynamically generated in accordance with the forward propagation result of each input image, and the mask is used instead of the traditional pixel-level annotation to complete the calculation of the loss function. The semantic segmentation model is trained in an end-to-end manner. A dynamic mask is only generated during the training process, and the test process only requires the forward propagation of the CNN. Result The segmentation accuracy of our method on the Pascal visual object classes(VOC)2012 dataset is 78.06%. Compared with existing weakly supervised semantic segmentation methods, such as box supervised(BoxSup)method, weakly and semi-supervised learning(WSSL) method, simple does it(SDI) method, and cut and paste(CaP) method, accuracy increases by 14.71%, 4.04%, 3.10%, and 0.92%, respectively. On the Berkeley deep drive(BDD 100K) dataset, the segmentation accuracy of our method is 61.56%. Compared with Boxsup, WSSL, SDI, and CaP, the accuracy increases by 10.39%, 3.12%, 1.35%, and 2.04%, respectively. The method has improved segmentation accuracy in the categories of pedestrians, cars, and traffic lights. Improvements are achieved in the categories of trucks and buses. The foreground targets of the two categories are typically large, and simple features tend to result in unsatisfactory segmentation. After the fusion of the underlying, middle, and high-level features in this study, the segmentation accuracy of such large targets is relatively significantly improved. Conclusion High-level features are used to estimate the approximate shape and position of the foreground object and generate rough edges, which will be corrected layer by layer with multilayer features. High-level semantic features can decrease edge information error in the semantic level, and low-level image features improve the accuracy of the edge. The training speed of our method is relatively slow because of the dynamic mask generation in each training step. However, test speed does not slow down because only the forward propagation calculation of the CNN is required.
Keywords

订阅号|日报