Current Issue Cover
结合双视觉全卷积网络的遥感影像地物提取

李道纪, 郭海涛, 张保明, 赵传, 卢俊, 余东行(中国人民解放军战略支援部队信息工程大学, 郑州 450000)

摘 要
目的 遥感影像地物提取是遥感领域的研究热点。由于背景和地物类型复杂多样,单纯利用传统方法很难对地物类别进行准确区分和判断,因而常常造成误提取和漏提取。目前基于卷积神经网络CNN(convolutional neural network)的方法进行地物提取的效果普遍优于传统方法,但需要大量的时间进行训练,甚至可能出现收敛慢或网络不收敛的情况。为此,基于多视觉信息特征的互补原理,提出了一种双视觉全卷积网络结构。方法 该网络利用VGG(visual geometry group)16和AlexNet分别提取局部和全局视觉特征,并经过融合网络对两种特征进行处理,以充分利用其包含的互补信息。同时,将局部特征提取网络作为主网络,减少计算复杂度,将全局特征提取网络作为辅助网络,提高预测置信度,加快收敛,减少训练时间。结果 选取公开的建筑物数据集和道路数据集进行实验,并与二分类性能优异的U-Net网络和轻量型Mnih网络进行对比。实验结果表明,本文提出的双视觉全卷积网络的平均收敛时间仅为U-Net网络的15.46%;提取精度与U-Net相当,远高于Mnih;在95%的置信水平上,该网络的置信区间明显优于U-Net。结论 本文提出的双视觉全卷积网络,融合了影像中地物的局部细节特征和全局特征,能保持较高的提取精度和置信度,且更易训练和收敛,为后续遥感影像地物提取与神经网络的设计提供了参考方向。
关键词
Double vision full convolution network for object extraction in remote sensing imagery

Li Daoji, Guo Haitao, Zhang Baoming, Zhao Chuan, Lu Jun, Yu Donghang(PLA Strategic Support Force Information Engineering University, Zhengzhou 450000, China)

Abstract
Objective Object extraction is a fundamental task in remote sensing. The accurate extraction of ground objects, such as buildings and roads, is beneficial to change detection, updating geographic databases, land use analysis, and disaster relief. Relevant methods for object extraction, such as for roads or buildings, have been observed over the past years. Some of these methods are based on the geometric features of objects, such as lines and line intersections. The most traditional approaches can obtain satisfactory results in rural areas and suburbs with high identification and positional accuracy, but low accuracy in complex urban areas. With the rise of deep learning and computer vision technology, a growing number of researchers have attempted to solve the related problems through deep learning method, which is proven to greatly improve the precision of object extraction. However, due to memory capacity limitations, most of these deep learning methods are patch-based. This operation cannot fully utilize the contextual information. At the edge region of the patch, the prediction confidence is much lower than that of the central region due to the lack of relevant information. Therefore, additional epochs are needed for feature extraction and training. In addition, objects often appear at extremely different scales in remote sensing images; thus, determining the right size of the vision area or the sliding window is difficult. Using larger patches to predict small labels is also an effective solution. In this manner, the confidence of the predicted label map is greatly increased and the network is easier to train and converge. Method This study proposes a novel architecture of the network called double-vision full convolution network (DVFCN). This architecture mainly includes three parts:encoder part of local vision (ELV), encoder part of global vision (EGV), and fusion decoding part (FD). The ELV is used to extract the detailed features of buildings and EGV is used to give the confidence over a larger vision. The FD is applied to restore the feature maps to the original patch size. Visual geometry group(VGG)16 and AlexNet are applied as the backbone of the encoder network in ELV and EGV, respectively. To combine the information of the two pathways, the feature maps are concatenated and fed into the FD. After the last level of FD, a smooth layer and a sigmoid activation layer are used to improve the feature processing ability and project the multichannel feature maps into the desired segmentation. Finally, skip connections are also applied to the DVFCN structure so that low-level finer details can be compensated to high-level semantic features. Training the model started on an NVIDIA 1080ti GPU with 11 GB onboard memory. The minimization of this loss is solved by an Adam optimizer with mini-batches of size 16, start learning rate of 0.001, and L2 weight decay of 0.000 5. The learning rate drops by 0.5 per 10 epochs. Result To verify the effectiveness of DVFCN, we conducted the experiments on two public datasets:European building datasets and Massachusetts road datasets. In addition, two variants of the DVFCN were tested, and U-Net and Mnih were also operated for comparison. To comprehensively evaluate the classification performance of the model, we plotted the receiver operating characteristic (ROC) curves and precision-recall curves. The area under the ROC curve (AUC) and F1 score were regarded as evaluation metrics. The experimental results show that DVFCN and U-Net can achieve almost the same superior classification performance. However, the total training time of DVFCN was only 15.4% of that of U-Net. The AUC of U-Net on building datasets and road datasets were 0.965 3 and 0.983 7, which were only 0.002 1 and 0.005 5 higher than DVFCN, respectively. The extraction effect on road and built-up was better than that of Mnih. In addition, the confidence rates of the two networks were also calculated. The experimental results show that the confidence of interval DVFCN is better than that of U-Net under 95% confidence. The importance of ELV and EGV is also studied. Result shows that the ELV is more important than EGV because it can provide more detailed local information. EGV performs poorly by itself because it can only provide global information. However, the global information is important for the convergence of DVFCN. Conclusion The DVFCN is proposed for object extraction from remote sensing imagery. The proposed network can achieve nearly the same extraction performance as U-Net, but the training time is much reduced and the confidence is higher. In addition, DVFCN provides a new full convolution network architecture that combines the local and global information from different visions. The proposed model can be further improved, and a more effective method of combining local and global context information will be developed in the future. Thus, studying the utilization of global information through a global approach is important.
Keywords

订阅号|日报