结合上下文特征与CNN多层特征融合的语义分割
摘 要
目的 针对基于区域的语义分割方法在进行语义分割时容易缺失细节信息,造成图像语义分割结果粗糙、准确度低的问题,提出结合上下文特征与卷积神经网络(CNN)多层特征融合的语义分割方法。方法 首先,采用选择搜索方法从图像中生成不同尺度的候选区域,得到区域特征掩膜;其次,采用卷积神经网络提取每个区域的特征,并行融合高层特征与低层特征。由于不同层提取的特征图大小不同,采用RefineNet模型将不同分辨率的特征图进行融合;最后将区域特征掩膜和融合后的特征图输入到自由形式感兴趣区域池化层,经过softmax分类层得到图像的像素级分类标签。结果 采用上下文特征与CNN多层特征融合作为算法的基本框架,得到了较好的性能,实验内容主要包括CNN多层特征融合、结合背景信息和融合特征以及dropout值对实验结果的影响分析,在Siftflow数据集上进行测试,像素准确率达到82.3%,平均准确率达到63.1%。与当前基于区域的端到端语义分割模型相比,像素准确率提高了10.6%,平均准确率提高了0.6%。结论 本文算法结合了区域的前景信息和上下文信息,充分利用了区域的语境信息,采用弃权原则降低网络的参数量,避免过拟合,同时利用RefineNet网络模型对CNN多层特征进行融合,有效地将图像的多层细节信息用于分割,增强了模型对于区域中小目标物体的判别能力,对于有遮挡和复杂背景的图像表现出较好的分割效果。
关键词
Semantic segmentation method with combined context features with CNN multi-layer features
Luo Huilan, Zhang Yun(School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China) Abstract
Objective Semantic segmentation plays an increasingly important role in visual analysis. It combines image classification, object detection, and image segmentation and classifies the pixels in an image through certain methods. Semantic segmentation divides an image into regions with certain semantic meanings and identifies the semantic categories of each region block. The semantic inference process from low to high levels is realized, and a segmented image with pixel-by-pixel semantic annotation is obtained. The semantic segmentation method based on candidate regions extracts free-form regions from the image, describes their features, classifies them based on regions, and converts the region-based prediction into pixel-level prediction. Although the candidate region-based model contributes to the development of semantic segmentation, it needs to generate many candidate regions. The process of generating candidate regions requires a huge amount of time and memory space. In addition, the quality of the candidate regions extracted by different algorithms and the lack of spatial information on the candidate areas, especially the loss of information on small objects, directly affect the final semantic segmentation. To solve the problem of rough semantic segmentation results and low accuracy ofregion-based semantic segmentation methods caused by the lack of detailed information, a semantic segmentation method that fuses the context and multiple layer features of convolutional neural networks is proposed in this study. Method First, candidate regions of different scales are generated from an image by using a selection method.The candidate area includes three parts, namely, square bounding box, foreground mask, and foreground size. The foreground mask is a binary mask that covers the foreground of the area over the candidate area. Multiplying the square region features on each channel with the corresponding foreground mask yields the foreground features of the region. Selective search uses graph-based image segmentation to generate several sub-regions, iteratively merges regions according to the similarity between sub-regions (i.e., color, texture, size, and spatial overlap), and outputs all possible regions of the target.Second, a convolutional neural network is used to extract the features of each region, and the high-and low-level features are fused in parallel. Parallel fusion combines the features of the same data set according to a certain rule, and the dimensions of the features must be the same before the combination.The features obtained by each convolutional layer are reduced using the linear discriminant analysis (LDA) method because of the different sizes of feature maps extracted from different layers. By selecting a projection hyperplane in the multi-dimensional space, the projection distance of the same category on the hyperplane is probably closer than the projection distance of different categories. The dimension reduction of LDA is only related to the number of categories because it is independent of the dimension of the data. The image dataset used in this work contains 33 categories. The LDA dimension reduction method is utilized to reduce the feature dimensions to 32, and this reduction decreases the size of the network's parameters. Moreover, LDA as a supervised algorithm can use prior knowledge on the class very well. Experimental results show that dimension reduction may lose some feature information but does not affect the segmentation result. After feature dimension reduction, the distance between different categories may increase, and the distance between the same categories may decrease, which can make the classification task easy. The RefineNet model is used to fuse feature maps with different resolutions. In this work, five feature map resolutions are used for fusion.The RefineNet network consists of three main components, namely, adaptive convolution, multi-resolution fusion, and chain residual pooling. The multi-resolution fusion part of the structure is utilized to adapt the input feature maps with a convolution layer, conduct upsampling, and perform pixel-level addition. The main task is to perform multi-resolution fusion to solve the problem of information loss caused by the downsampling operation and allow the image features extracted by each layer to be added to the final segmentation network. Finally, the regional feature mask and the fused feature map are inputted into the free-form pool of interest regions, and the pixel-level classification label of the image is obtained through the softmax classification layer. Result Context and convolutional neural network (CNN) multi-layer features are used for semantic segmentation, which exhibits good performance.The experimental content mainly includes CNN multi-layer feature fusion, combination of background information and fusion features, and the influence of dropout values on the experimental results.The training model is tested on the Siftflow dataset with a pixel accuracy of 82.3% and an average accuracy of 63.1%. Compared with the current region-based, end-to-end semantic segmentation model, the pixel accuracy is increased by 10.6% and the average accuracy is increased by 0.6%. Conclusion A semantic segmentation algorithm that combines context features with CNN multi-layer features is proposed in this study. The foreground and context information of the region are combined in the proposed method to utilize the context information of the region. The abstention principle is employed to reduce the parameter quantity of the network and avoid over-fitting, and the RefineNet network model is used to fuse the multi-layer features of CNN. By effectively using the multi-layer detail information of the image for segmentation, the model's capability to discriminate between small and medium-sized objects in the region is enhanced, and the segmentation effect is improved for images with occlusion and complex backgrounds. The experimental results show that the proposed method has a better segmentation effect,better segmentation performance, and higher robustness than several state-of-the-art methods.
Keywords
semantic segmentation convolutional neural network (CNN) feature fusion selection search RefineNet model
|