融合语义—表观特征的无监督前景分割
摘 要
目的 前景分割是图像理解领域中的重要任务,在无监督条件下,由于不同图像、不同实例往往具有多变的表达形式,这使得基于固定规则、单一类型特征的方法很难保证稳定的分割性能。针对这一问题,本文提出了一种基于语义-表观特征融合的无监督前景分割方法(semantic apparent feature fusion,SAFF)。方法 基于语义特征能够对前景物体关键区域产生精准的响应,但往往产生的前景分割结果只关注于关键区域,缺乏物体的完整表达;而以显著性、边缘为代表的表观特征则提供了更丰富的细节表达信息,但基于表观规则无法应对不同的实例和图像成像模式。为了融合表观特征和语义特征优势,研究建立了融合语义、表观信息的一元区域特征和二元上下文特征编码的方法,实现了对两种特征表达的全面描述。接着,设计了一种图内自适应参数学习的方法,用于计算最适合的特征权重,并生成前景置信分数图。进一步地,使用分割网络来学习不同实例间前景的共性特征。结果 通过融合语义和表观特征并采用图像间共性语义学习的方法,本文方法在PASCAL VOC(pattern analysis,statistical modelling and computational learning visual object classes)2012训练集和验证集上取得了显著超过类别激活映射(class activation mapping,CAM)和判别性区域特征融合方法(discriminative regional feature integration,DRFI)的前景分割性能,在F测度指标上分别提升了3.5%和3.4%。结论 本文方法可以将任意一种语义特征和表观特征前景计算模块作为基础单元,实现对两种策略的融合优化,取得了更优的前景分割性能。
关键词
Semantic-apparent feature-fusion-based unsupervised foreground segmentation method
Li Xi1, Ma Huimin2, Ma Hongbing1,3, Wang Yidong1(1.Tsinghua University, Beijing 100084, China;2.University of Science and Technology Beijing, Beijing 100083, China;3.Xinjiang University, Urumqi 830046, China) Abstract
Objective Foreground segmentation is an essential research in the field of image understanding, which is a pre-processing step for saliency object detection, semantic segmentation, and various pixel-level learning tasks. Given an image, this task aims to provide each pixel a foreground or background annotation. For fully supervision-based methods, satisfactory results can be achieved via multi-instance-based learning. However, when facing the problem under unsupervised conditions, achieving a stable segmentation performance based on fixed rules or a single type of feature is difficult because different images and instances always have variable expressions. Moreover, we find that different types of method have different advantages and disadvantages on different aspects. On the one hand, semantic feature-based learning methods could provide accurate key region extraction of foregrounds but could not generate complete object region and edges in detail. On the other hand, richer detailed expression can be obtained based on an apparent feature-based framework, but it cannot be suitable for variable kinds of cases. Method Based on the observations, we propose an unsupervised foreground segmentation method based on semantic-apparent feature fusion. First, given a sample, we encode it as semantic and apparent feature map. We use a class activation mapping model pretrained on ImageNet for semantic heat map generation and select saliency and edge maps to express the apparent feature. Each kind of semantic and apparent feature can be used, and the established framework is widely adaptive for each case. Second, to combine the advantages of the two type of features, we split the image as super pixels, and set the expression of four elements as unary and binary semantic and apparent feature, which realizes a comprehensive description of the two types of expressions. Specifically, we build two binary relation matrices to measure the similarity of each pair of super pixels, which are based on apparent and semantic feature. For generating the binary semantic feature, we use the apparent feature-based similarity measure as a weight to provide the element for each super pixel, in which semantic-feature-based similarity measure is utilized for binary apparent feature calculation. Based on the different view for feature encoding, the two types of information could be fused for the first time. Then, we propose a method for adaptive parameter learning to calculate the most suitable feature weights and generate the foreground confidence score map. Based on the four elements, we could establish an equation to express each super pixel's foreground confidence score using the least squares method. For an image, we first select super pixels with higher confident scores of unary semantic and apparent feature on foreground or background. Then, we can learn weights of the four elements and bias' linear combination by least squares estimation. Based on the adaptive parameters, we can achieve a better confidence score inference for each super pixel individually. Third, we use segmentation network to learn foreground common features from different instances. In a weakly supervised semantic segmentation task, the fully supervision-based framework is used for improving pseudo annotations for training data and providing inference results. Inspired by the idea, we use the convolution network to mine foreground common feature from different instances. The trained model could be utilized to optimize the quality of foreground segmentation for both images used for network training and new data directly. A better performance can be achieved by fusing semantic and apparent features as well as cascading the modules of intra image adaptive feature weight learning and inter-image common feature learning. Result We test our methods on the pattern analysis, statistical modelling and computational learning visual object classes(PASCAL VOC)2012 training and evaluation set, which include 10 582 and 1 449 samples, respectively. Precision-recall curve as well as F-measure are used as indicators to evaluate the experimental results. Compared with typical semantic and apparent feature-based foreground segmentation methods, the proposed framework achieves superior improvement of baselines. For PASCAL VOC 2012 training set, the F-measure has a 3.5% improvement, while a 3.4% increase is obtained on the validation set. We also focus on the performance on visualized results for analysis the advantages of fusion framework. Based on comparison, we can find that results with accurate, detailed expression can be achieved based on the adaptive feature fusion operation, while incorrect cases can further be modified via multi-instance-based learning framework. Conclusion In this study, we propose a semantic-apparent feature fusion method for unsupervised foreground segmentation. Given an image as input, we first calculate the semantic and apparent feature of the unary region of each super pixel in image. Then, we integrate two types of features through the cross-use of similarity measure of apparent and semantic feature. Next, we establish a context relationship for each pair of super pixels to calculate the binary feature of each region. Further, we establish an adaptive weight learning strategy. We obtain the weighting parameters for optimal foreground segmentation and achieve the confidence in the image foreground by automatically adjusting the influence of each dimensional feature on the foreground estimation in each specific image instance. Finally, we build a foreground segmentation network model to learn the common features of foreground between different instances and samples. Using the trained network model, the image can be re-inferred to obtain more accurate foreground segmentation results. The experiments on the PASCAL VOC 2012 training set and validation set prove the effectiveness and generalization ability of the algorithm. Moreover, the method proposed can use other foreground segmentation methods as a baseline and is widely used to improve the performance of tasks such as foreground segmentation and weakly supervised semantic segmentation. We also believe that to consider the introduction of various types of semantic and apparent feature fusion as well as adopt alternate iterations to mine the internal spatial context information of image and the common expression features between different instance is a feasible way to improve the performance of foreground segmentation further and an important idea for semantic segmentation tasks.
Keywords
computer vision foreground segmentation unsupervised learning semantic-apparent feature fusion natural scene images PASCAL VOC dataset adaptive weighting
|