结合细粒度特征与深度卷积网络的手绘图检索
摘 要
目的 传统的手绘图像检索方法主要集中在检索相同类别的图像,忽略了手绘图像的细粒度特征。对此,提出了一种新的结合细粒度特征与深度卷积网络的手绘图像检索方法,既注重通过深度跨域实现整体匹配,也实现细粒度细节匹配。方法 首先构建多通道混合卷积神经网络,对手绘图像和自然图像分别进行不同的处理;其次通过在网络中加入注意力模型来获取细粒度特征;最后将粗细特征融合,进行相似性度量,得到检索结果。结果 在不同的数据库上进行实验,与传统的尺度不变特征(SIFT)、方向梯度直方图(HOG)和深度手绘模型Deep SaN(sketch-a-net)、Deep 3DS(sketch)、Deep TSN(triplet sketch net)等5种基准方法进行比较,选取了Top-1和Top-10,在鞋子数据集上,本文方法Top-1正确率提升了12%,在椅子数据集上,本文方法Top-1正确率提升了11%,Top-10提升了3%,与传统的手绘检索方法相比,本文方法得到了更高的准确率。在实验中,本文方法通过手绘图像能在第1幅检索出绝大多数的目标图像,达到了实例级别手绘检索的目的。结论 提出了一种新的手绘图像检索方法,为手绘图像和自然图像的跨域检索提供了一种新思路,进行实例级别的手绘检索,与原有的方法相比,检索精度得到明显提升,证明了本文方法的可行性。
关键词
Sketch-based image retrieval based on fine-grained feature and deep convolutional neural network
Li Zongmin1, Liu Xiuxiu1, Liu Yujie1, Li Hua2(1.College of Computer and Communication Engineering, China University of Petroleum, Qingdao 266580, China;2.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China) Abstract
Objective Content-based image retrieval or text-based retrieval has played a major role in practical computer vision applications. In several scenarios, however, retrieval becomes a problem when sample queries are unavailable or describing them with a keyword is difficult. However, compared with text, sketches can intrinsically capture object appearance and structure. Sketches are incredibly intuitive to humans and descriptive in nature. They provide a convenient and intuitive means to specify object appearance and structure. As a query modality, they offer a degree of precision and flexibility that is missing in traditional text-based image retrieval. Closely correlated with the proliferation of touch-screen devices, sketch-based image retrieval has become an increasingly prominent research topic in recent years. Conventional sketch-based image retrieval (SBIR) principally focuses on retrieving images of the same category and disregards the fine-grained feature of sketches. However, SBIR is challenging because humans draw free-hand sketches without any reference but only focus on the salient object structures. Hence, the shapes and scales in sketches are usually distorted compared with those in natural images. To deal with this problem, studies have developed methods to bridge the domain gap between sketches and natural images for SBIR. These approaches can be roughly divided into hand-crafted and cross-domain deep learning-based methods. SBIR generates approximate sketches by extracting edge or contour maps from natural images. Afterward, hand-crafted features are extracted for sketches and edge maps of natural images, which are then fed into "bag-of-words" methods to generate representations for SBIR. The major limitation of hand-crafted methods is that the domain gap between sketches and natural images cannot be well remedied because matching edge maps to non-aligned sketches with large variations and ambiguity is difficult. For this problem, we propose a novel sketch-based image retrieval method based on fine-grained feature and deep convolutional neural network. This fine-grained SBIR (FG-SBIR) approach focuses not only on coarse holistic matching via a deep cross-domain but also on explicit accounting for fine-grained detail matching. The proposed deep convolutional neural network is designed for sketch-based image retrieval. Method Most existing SBIR studies have focused on category-level sketch-to-photo retrieval. A bag-of-words representation combined with a form of edge detection from photo images is often employed to bridge the domain gap. Previous work that attempted to address the fine-grained SBIR problem is based on a deformable part-based model and graph matching. However, the definition of fine-grained in previous work is different from ours-a sketch is considered to be a match to a photo if the objects depicted look similar. In addition, these hand-crafted feature-based approaches are inadequate in capturing the subtle intra-category and inter-instance differences, as demonstrated in our experiments. Our methods are demonstrated as follows:First, we construct a multiple branch of confusing deep convolutional neural network to perform a different deal with sketch and natural image; Three different branches are used:one sketch branch and two nature image branches. The sketch branch has four convolutional and two pooling layers, whereas the natural image branch has five and two, respectively. By adding a convolutional layer to obtain abstractive natural image features, the problem of abstraction level inconsistency is solved. Different branch designs can reduce domain differences. Second, we extract detail information by adding the attention model in the neural network. Most attention models learn an attention mask, which assigns different weights to different regions of an image. Soft attention is the most commonly used model because it is differentiable and can thus be learned end-to-end with the rest of the network. Our attention model is specifically designed for FG-SBIR in that it is robust against spatial misalignment through the shortcut connection architecture. Third, we combine coarse and fine semantic information to achieve retrieval. By combining the information, we obtain robust features. Finally, we use deep triplet loss to obtain good results. The loss is defined using the max-margin framework. Result The experiment on different benchmark datasets comprises shoe and chair datasets. We use two traditional hand-crafted feature-based models, namely, scale-invariant feature transform (SIFT) and histogram of oriented gradient (HOG), apart from three other baseline models, namely, deep SaN, deep 3D, and deep TSN, which use the deep features designed for the sketch. We utilize the ratio of correctly predicting the true match at Top1 and Top10 as the evaluation metrics. We compare the performance of our full model and the five baselines. Results prove that the proposed method obtains higher retrieval precision than the traditional methods. Our model performs the best overall in each metric and in both datasets. The improvement is particularly clear at Top1, with an approximately 12% increase. In the chair dataset, we obtain an approximately 11% increase. Moreover, we obtain an approximately 3% increase at Top10. In other words, we can acquire the right result on the first image. In the proposed method, we wish to achieve instance-level retrieval. Thus, the proposed model obtains good results in the FG-SBIR task. Conclusion The proposed sketch-based image retrieval provides a new means of thinking for the cross-domain retrieval of sketch and natural images. This sketch convolutional neural network obtains good results in sketch-based image retrieval. This task is more challenging than the well-studied category-level SBIR task, but it is also more useful for commercial SBIR adoption. Achieving fine-grained retrieval across the sketch/image gap requires a deep network learned with triplet annotation requirements. We demonstrate how to sidestep these requirements in order to achieve good performance in this new and challenging task. By introducing attention modeling and the sketch convolutional neural network, the model can concentrate on the subtle differences between local regions of a sketch and photo images and compute deep features containing fine-grained and high-level semantics. The proposed sketch neural network is suitable for FG-SIBR.
Keywords
sketch-based image retrieval (SBIR) convolutional neural network attention model fine-grained feature feature fusion
|