增量角度域损失和多特征融合的地标识别
摘 要
目的 地标识别是图像和视觉领域一个应用问题,针对地标识别中全局特征对视角变化敏感和局部特征对光线变化敏感等单一特征所存在的问题,提出一种基于增量角度域损失(additive angular margin loss,ArcFace损失)并对多种特征进行融合的弱监督地标识别模型。方法 使用图像检索取Top-1的方法来完成识别任务。首先证明了ArcFace损失参数选取的范围,并于模型训练时使用该范围作为参数选取的依据,接着使用一种有效融合局部特征与全局特征的方法来获取图像特征以用于检索。其中,模型训练过程分为两步,第1步是在谷歌地标数据集上使用ArcFace损失函数微调ImageNet预训练模型权重,第2步是增加注意力机制并训练注意力网络。推理过程分为3个部分:抽取全局特征、获取局部特征和特征融合。具体而言,对输入的查询图像,首先从微调卷积神经网络的特征嵌入层提取全局特征;然后在网络中间层使用注意力机制提取局部特征;最后将两种特征向量横向拼接并用图像检索的方法给出数据库中与当前查询图像最相似的结果。结果 实验结果表明,在巴黎、牛津建筑数据集上,特征融合方法可以使浅层网络达到深层预训练网络的效果,融合特征相比于全局特征(mean average precision,mAP)值提升约1%。实验还表明在神经网络嵌入特征上无需再加入特征白化过程。最后在城市级街景图像中本文模型也取得了较为满意的效果。结论 本模型使用ArcFace损失进行训练且使多种特征相似性结果进行有效互补,提升了模型在实际应用场景中的抗干扰能力。
关键词
Landmark recognition based on ArcFace loss and multiple feature fusion
Mao Xueyu1,2, Peng Yanbing2(1.Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;2.Nanjing FiberHome World Communication Technology Co., Ltd., Nanjing 210019, China) Abstract
Objective Landmark recognition, which is a new application in computer vision, has been increasing investigated in the past several years and has been widely used to implement landmark image recognition function in image retrieval. However, this application has many problems unsolved, such as the global features are sensitive to view change, and the local features are sensitive to light change. Most existing methods based on convolutional neural network (CNN) are used to extract image features for replacing traditional feature extraction methods, such as scale-invariant feature transform(SIFT) or speeded up robust feature (SURF). At present, the best model is deep local feature(DeLF), but its retrieval needs the combination of product quantization(PQ) and K-dimensional(KD) trees. The process is complex and consumes approximately 6 GB of display memory, which is unsuitable for rapid deployment and use, and the most time-consuming process is random sample consensus. Method A multiple feature fusion method is needed when focusing on the problems of a single feature, and multiple features can be horizontally connected to create a single vector for improving the performance of CNN global features. For large-scale landmark data, manual labeling of images is time consuming and laborious, and artificial cognitive bias exists in labeling. To minimize human work in labeling images, weakly supervised loss, such as the additive angular margin loss function(ArcFace loss function), which is improved from standard cross-entry loss and changes the Euclidean distances to angular domain, is used to train the model in image-level annotations. The ArcFace loss function performs well in facial recognition and image classification and is easy to use in other deep learning applications. This paper provides the values of the parameters in ArcFace loss function and the proof process. Thus, a weakly supervised recognition model based on ArcFace loss and multiple feature fusion is proposed for landmark recognition. The proposed model uses ResNet50 as its trunk and has two steps in model training, including the trunk's finetuning and attention layer's training. Finetuning uses the Google landmark image dataset, and the trunk is finetuned on the weights pretrained on the ImageNet dataset. The average pooling layer is replaced by a generalized mean(GeM) pooling layer because it is proven useful in image retrieval. The attention mechanism is built using two convolutional layers that use 1×1 kernel to train the features focusing on the local features needed. Image preprocessing is required before training. The preprocessing consists of three stages, including center crop/resize and random crop. People usually prefer to place buildings and themselves in the center of images. Thus, a center crop method is suitable to ignore the problems occurring in padding or resizing. The proposed model uses classification training to complete the image retrieval task. The final input image size is set to 4482. This value is a compromise value because the input image size in image retrieval is usually 800×8001 500×1 500 pixels and the classification size is 224×224~300×300 pixels. The image is center cropped first, and then its size is resized to 500×500 pixels because it is a useful method to enhance the data through random cropping. For inference, the image is center cropped and directly resized to 448×448 pixels because it only needs to be processed twice. The inference of this model is divided into three parts, namely, extracting global features, obtaining local features, and feature fusion. For the inputted query image, the global feature is first extracted from the embedding layer of CNN fine-tuned by ArcFace loss function; Second, the attention mechanism is used to obtain local features in the middle layer of the network, and the useful local features must be larger than the threshold; finally, two features are fused, and the results that are the most similar with the current query image in the database are obtained through image retrieval. Result We compared the proposed model with several state-of-the-art models, including the traditional approaches and deep learning methods on two public reviewed datasets, namely, Oxford and Paris building datasets. The two datasets are reconstructed in 2018 and are classified into three levels, namely, easy, medium, and hard. Three groups of comparisons are used in the experiment, and they are all compared on the reviewed Oxford and Paris datasets. The first group is to compare the proposed model's performance with other models, such as HesAff-rSIFT-VLAD and VggNet-NetVLAD. The second group is designed to compare the performance of single global feature with the performance of fused features. The last group compares the results obtained from the whiting of the extracted features at different layers of the proposed model. Results show that the feature fusion method can make the shallow network achieve the effect of deep pretrained network, and the mean average precision(mAP) increases by approximately 1% compared with the global features on the two previously mentioned datasets. The proposed model achieves satisfactory results in urban street view images. Conclusion In this study, we proposed a composite model that contains a CNN, an attention model, and a fusion algorithm to fuse two types of features. Experimental results show that the proposed model performs well, the fusion algorithm improves its performance, and the performance in urban street datasets ensures the practical application value of the proposed model.
Keywords
landmark recognition additive angular margin loss function(ArcFace loss function) attention mechanism multiple features fusion convolutional neural network(CNN)
|