Current Issue Cover
深度多模态融合服装风格检索

苏卓1,2, 柯司博1,2, 王若梅1,2, 周凡1,2(1.中山大学计算机学院, 广州 510006;2.中山大学国家数字家庭工程技术研究中心, 广州 510006)

摘 要
目的 服装检索方法是计算机视觉与自然语言处理领域的研究热点,其包含基于内容与基于文本的两种查询模态。然而传统检索方法通常存在检索效率低的问题,且很少研究关注服装在风格上的相似性。为解决这些问题,本文提出深度多模态融合的服装风格检索方法。方法 提出分层深度哈希检索模型,基于预训练的残差网络ResNet(residual network)进行迁移学习,并把分类层改造成哈希编码层,利用哈希特征进行粗检索,再用图像深层特征进行细检索。设计文本分类语义检索模型,基于LSTM(long short-term memory)设计文本分类网络以提前分类缩小检索范围,再以基于doc2vec提取的文本嵌入语义特征进行检索。同时提出相似风格上下文检索模型,其参考单词相似性来衡量服装风格相似性。最后采用概率驱动的方法量化风格相似性,并以最大化该相似性的结果融合方法作为本文检索方法的最终反馈。结果 在Polyvore数据集上,与原始ResNet模型相比,分层深度哈希检索模型的top5平均检索精度提高11.6%,检索速度提高2.57 s/次。与传统文本分类嵌入模型相比,本文分类语义检索模型的top5查准率提高29.96%,检索速度提高16.53 s/次。结论 提出的深度多模态融合的服装风格检索方法获得检索精度与检索速度的提升,同时进行了相似风格服装的检索使结果更具有多样性。
关键词
Fashion style retrieval based on deep multimodal fusion

Su Zhuo1,2, Ke Sibo1,2, Wang Ruomei1,2, Zhou Fan1,2(1.School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China;2.National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou 510006, China)

Abstract
Objective Fashion retrieval method is a research hotspot in the field of computer vision and natural language processing. It aims to help users easily and quickly retrieve clothes that meet the query conditions from a large number of clothing. To make the retrieval method more diverse and convenient, the retrieval method researched in recent years usually includes the image query mode for intuitive retrieval and the text query mode for supplementary retrieval, that is, content-based image retrieval and text-based image retrieval. However, most of them pay attention to the precise matching in vision, and few pay attention to the similarity in style of clothing. In addition, the extracted feature dimensions are usually high, which leads to low retrieval efficiency. To solve these problems, we propose a fashion style retrieval method based on deep multimodal fusion. Method To solve the problem of low efficiency of image query mode, a hierarchical deep hash retrieval model is first proposed in this study. Its image deep feature extraction network is based on the pre-trained residual network ResNet for migration learning, which can learn the image deep features at a lower cost. The network classification layer is transformed into a hash coding layer, which can generate simple hash features. In this study, hash features are used for coarse retrieval, while in the fine retrieval stage, the preliminary results are rearranged based on the deep features of the image. To solve the problem of low efficiency of text query mode and to improve the scalability of the search engine, a text classification semantic retrieval model is proposed in this study, which designs a text classification network based on long short-term memory(LSTM) to classify query text in advance. Then, we construct a text embedding feature extraction model based on doc2vec, which can retrieve the text embedding feature in the pre-classified categories. At the same time, to capture the similarity of clothing style, a similar style context retrieval model is proposed, which measures the similarity of clothing style by referring to the similarity of part of speech and collocation level of words, references the training form of word2vec model in text words, and trains clothing as words and outfit as sentences. Finally, we use the probability driven method to quantify fashion style similarity without manual style annotation; compare different multimodal hybrid methods to maximize the similarity as the final return of search engine, that is, based on the text retrieval modal results to retrieve style context similar clothing; and rearrange all modal results and style context results based on image features. Result Choosing Polyvore as the dataset, we use the test set data as the query and retrieve the returned training set data as the result, so as to evaluate the results for different indicators. For the image retrieval mode, compared with the original ResNet model, the average retrieval accuracy of top 5 of the hierarchical deep hash retrieval framework is improved by 11.6%, and the retrieval speed is increased by 2.57 s/query. The average retrieval accuracy of the two feature retrieval strategies from coarse to fine is comparable to that of the direct image deep feature retrieval. For the text retrieval mode, compared with the traditional text embedding model, the top 5 precision of the text classification semantic retrieval framework is increased by 29.96%, and the retrieval speed is increased by 16.53 s/query. Finally, for the multimodal fusion results, we retrieve the context style similar clothing based on the text modal results and rearrange the final results in the image feature space. The average style similarity of the final results is 24%. Conclusion We propose a fashion style retrieval method based on deep multimodal fusion, whose hierarchical deep hash retrieval model is used as the image retrieval mode. Compared with most other modes and retrieval methods, the method of fine-tuning based on pre-training network with the goal of generating hash code and retrieval strategy from coarse to fine can improve the retrieval accuracy and speed. As the text retrieval mode, the text classification semantic retrieval model uses the text classification network to narrow the scope of retrieval and then uses the text features extracted from the text feature extraction model combined with the output of different models for retrieval. Compared with other text semantic retrieval methods, this mode can also improve the retrieval speed and accuracy. At the same time, in order to capture the similarity of fashion style, a similar style context retrieval model is proposed to find the results similar to the query clothing style and make the results more diverse.
Keywords

订阅号|日报