部件检测和语义网络的细粒度鞋类图像检索
摘 要
目的 细粒度图像检索是当前细粒度图像分析和视觉领域的热点问题。以鞋类图像为例,传统方法仅提取其粗粒度特征且缺少关键的语义属性,难以区分部件间的细微差异,不能有效用于细粒度检索。针对鞋类图像检索大多基于简单款式导致检索效率不高的问题,提出一种结合部件检测和语义网络的细粒度鞋类图像检索方法。方法 结合标注后的鞋类图像训练集对输入的待检鞋类图像进行部件检测;基于部件检测后的鞋类图像和定义的语义属性训练语义网络,以提取待检图像和训练图像的特征向量,并采用主成分分析进行降维;通过对鞋类图像训练集中每个候选图像与待检图像间的特征向量进行度量学习,按其匹配度高低顺序输出检索结果。结果 实验在UT-Zap50K数据集上与目前检索效果较好的4种方法进行比较,检索精度提高近6%。同时,与同任务的SHOE-CNN(semantic hierarchy of attribute convolutional neural network)检索方法比较,本文具有更高的检索准确率。结论 针对传统图像特征缺少细微的视觉描述导致鞋类图像检索准确率低的问题,提出一种细粒度鞋类图像检索方法,既提高了鞋类图像检索的精度和准确率,又能较好地满足实际应用需求。
关键词
Fine-grained shoe image retrieval by part detection and semantic network
Chen Qian1, Liu Li1,2, Fu Xiaodong1,2, Liu Lijun1,2, Huang Qingsong1,2(1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;2.Computer Technology Application Key Lab of Yunnan Province, Kunming 650500, China) Abstract
Objective Fine-grained image retrieval is a major issue in current fine-grained image analysis and computer vision. Traditional methods typically retrieve similar replicated images, which are primarily based on large-scale coarse-grained retrieval but with low precision. Fine-grained image retrieval belongs to fine-grained image identification and retrieval subclasses. The traditional image retrieval task extracts only the coarse-grained features of images and cannot be effectively used for fine-grained retrieval. It also lacks key semantic attributes, and this deficiency brings difficulty in distinguishing the nuances among parts. The difficulty in fine-grained image retrieval is that the traditional coarse-grained feature extraction cannot represent images effectively. Fine-grained images of the same subclasses also cause a significant difference due to such factors as shape, posture, and color; consequently, search results cannot be effectively applied to actual needs. Compared with conventional image analysis problems, fine-grained image retrieval is more challenging due to the inter-level subcategories of its smaller class differences and the class differences within the larger ones. A fine-grained image retrieval method by part detection and semantic network for various shoe images is therefore proposed to solve the above-mentioned problems. Method First, part-based detection is conducted to detect undetected shoe images through an annotated training dataset of shoe images. Second, the semantic network is trained based on the semantic attributes of the detected shoe and training images, and feature vectors are extracted. Third, principal component analysis is used for dimensionality reduction. Finally, the results are implemented and output by metric learning to calculate the similarity among images, and fine-grained image retrieval is implemented. On the UT-Zap50K dataset, fine-grained shoe attributes are defined for the shoe images in combination with the component area of shoes. The toe area defines a shape attribute that contains five attribute values. Two attributes of shape and height are defined for the heel area, which contains 13 attribute values. A height attribute is defined for the upper area, which contains four attribute values. A closed-mode attribute is defined for the upper area, which contains nine attribute values. Footwear global properties are defined to include colors and styles, which contain 20 attribute values. Result The experiment is compared with four methods with good retrieval performance on the UT-Zap50K dataset. The retrieval accuracy is improved by nearly 6%. Compared with the semantic hierarchy of attribute convolutional neural network(SHOE-CNN) retrieval method of the same task, the proposed method has higher retrieval accuracy. The proposed semantic network is compared with traditional GIST(generalized search trees) features, the linear support vector machine(LSVM) method, and the deep learning method to illustrate the effectiveness of the proposed retrieval method. The performance is evaluated in terms of the accuracy of top-K retrieval. Results show that the method based on deep learning is much better than the traditional GIST features and LSVM method. The retrieval accuracy of this method is better than that of the metric network and SHOE-CNN by combining a metric learning algorithm. Conclusion A fine-grained shoe image retrieval method is proposed to address the low accuracy of shoe image retrieval caused by the lack of fine visual description of traditional image features. The method can accurately detect different parts of a shoe image and define the detailed semantic attributes of the shoe image. The visual attribute features of the shoe image are obtained by training the semantic network. The problem of unsatisfied accuracy of shoe image retrieval caused by using only coarse-grained features to represent images is solved. The experimental results show that the proposed method can retrieve the same image as the image to be detected on the UT-Zap50K dataset. The accuracy can reach 80% and 86% while ensuring the running efficiency. However, this method exhibits shortcomings. On the one hand, the accuracy of partial image detection is low because of the many styles and complexity of shoes. On the other hand, the prediction of some semantic attributes is inaccurate, and the fine-grained semantic attributes of shoe images are imperfect. The follow-up work will focus on these issues to improve the search accuracy, and the application issues will be extended to different scenarios.
Keywords
fine-grained image retrieval shoe image part detection semantic network feature vector metric learning
|