Current Issue Cover
基于深度学习的跨模态检索综述

尹奇跃, 黄岩, 张俊格, 吴书, 王亮(中国科学院自动化研究所, 北京 100190)

摘 要
由于多模态数据的快速增长,跨模态检索受到了研究者的广泛关注,其将一种模态的数据作为查询条件检索其他模态的数据,如用户可以用文本检索图像或/和视频。由于查询及其检索结果模态表征的差异,如何度量不同模态之间的相似性是跨模态检索的主要挑战。随着深度学习技术的推广及其在计算机视觉、自然语言处理等领域的显著成果,研究者提出了一系列以深度学习为基础的跨模态检索方法,极大缓解了不同模态间相似性度量的挑战,本文称之为深度跨模态检索。本文从以下角度综述有代表性的深度跨模态检索论文,基于所提供的跨模态信息将这些方法分为3类:基于跨模态数据间一一对应的、基于跨模态数据间相似度的以及基于跨模态数据语义标注的深度跨模态检索。一般来说,上述3类方法提供的跨模态信息呈现递增趋势,且提供学习的信息越多,跨模态检索性能越优。在上述不同类别下,涵盖了7类主流技术,即典型相关分析、一一对应关系保持、度量学习、似然分析、学习排序、语义预测以及对抗学习。不同类别下包含部分关键技术,本文将具体阐述其中有代表性的方法。同时对比提供不同跨模态数据信息下不同技术的区别,以阐述在提供了不同层次的跨模态数据信息下相关技术的关注点与使用异同。为评估不同的跨模态检索方法,总结了部分代表性的跨模态检索数据库。最后讨论了当前深度跨模态检索待解决的问题以及未来的研究方向。
关键词
Survey on deep learning based cross-modal retrieval

Yin Qiyue, Huang Yan, Zhang Junge, Wu Shu, Wang Liang(Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China)

Abstract
Over the last decade, different types of media data such as texts, images, and videos grow rapidly on the internet. Different types of data are used for describing the same events or topics. For example, a web page usually contains not only textual description but also images or videos for illustrating the common content. Such different types of data are referred as multi-modal data, which inspire many applications, e.g., multi-modal retrieval, hot topic detection, and perso-nalize recommendation. Nowadays, mobile devices and emerging social websites (e.g., Facebook, Flickr, YouTube, and Twitter) are diffused across all persons, and a demanding requirement for cross-modal data retrieval is emergent. Accordingly, cross-modal retrieval has attracted considerable attention. One type of data is required as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or/and videos. The query and its retrieved results can have different modalities; thus, measuring the content similarity between different modalities of data, i.e., reducing heterogeneity gap, remains a challenge. With the rapid development of deep learning techniques, various deep cross-modal retrieval approaches have been proposed to alleviate this problem, and promising performance has been obtained. We aim to review and comb representative methods for deep learning based cross-modal retrieval. We first classify these approaches into three main groups based on the cross-modal information provided, i.e.: 1) co-occurrence information, 2) pairwise information, and 3) semantic information. Co-occurrence information based methods indicate that only co-occurrence information is utilized to learn common representations across multi-modal data, where co-occurrence information indicates that if different modalities of data co-exist in a multi-modal document, then they have the same semantic. Pairwise information based methods indicate that similar pairs and dissimilar pairs are utilized to learn the common representations. A similarity matrix for all modalities is usually provided indicating whether or not two points from the modalities are in the same categories. Semantic information based methods indicate that class label information is provided to learn common representations, where a multi-modal example can have one or more labels with massive manual annotation. Usually, co-occurrence information exists in pairwise information and semantic information based approaches, and pairwise information can be derived when semantic information is provided. However, these relationships do not necessarily hold. In each category, various techniques can be utilized and combined to fully use the provided cross-modal information. We roughly categorize these techniques into seven main classes, as follows: 1) canonical correlation analysis, 2) correspondence preserving, 3) metric learning, 4) likelihood analysis, 5) learning to rank, 6) semantic prediction, and 7) adversarial learning. Canonical correlation analysis methods focus on finding linear combinations of two vectors of random variables with the objective of maximizing the correlation. When combined with deep learning, linear projections are replaced with deep neural networks with extra considerations. Correspondence preserving methods aim at preserving the co-existing relationship of different modalities with the objective of minimizing their distances in the learned embedding space. Usually, the multi-modal correspondence relationship is formed as regularizers or loss functions to enforce a pairwise constraint for learning multi-modal common representations. Metric learning approaches seek to establish a distance function for measuring multi-modal similarities with the objective to pull similar pairs of modalities closer and dissimilar pairs apart. Compared with correspondence preserving and canonical correlation analysis methods, similar pairs and dissimilar pairs are provided as restricted conditions when learning common representations. Likelihood analysis methods, based on Bayesian analysis, are generative approaches with the objective of maximizing the likelihood of the observed multi-modal relationship, e.g., similarity. Conventionally, the maximum likelihood estimation objective is derived to maximize the posterior probability of multi-modal observation. Learning to rank approaches aim to construct a ranking model constrained on the common representations with the objective of maintaining the order of multi-modal similarities. Compared with metric learning methods, explicit ranking loss based objectives are usually developed for ranking similarity optimization. Semantic prediction methods are similar to traditional classification model with the objective of predicting accuracy semantic labels of multi-modal data or their relationships. With such high-level semantics utilized, intramodal structure can effectively reflect learning multi-modal common representations. Adversarial learning approaches refer to methods using generative adversarial networks with the objective of being unable to infer the modality sources for learning common representations. Usually, the generative and discriminative models are carefully designed to form a min-max game for learning statistical inseparable common representations. We introduce several multi-modal datasets in the community, i.e., the Wiki image-text dataset, the INRIA-Websearch dataset, the Flickr30K dataset, the Microsoft common objects in context(MS COCO) dataset, the Real-world Web Image Dataset from National University of Singapore(NUS-WIDE) dataset, the pattern analysis, statistical modelling and computational learning visual object classes(PPSCAL Voc) dataset, and the XMedia dataset. Finally, we discuss open problems and future directions. 1) Some researchers have put forward transferred/extendable/zero-shot cross-modal retrieval, which claims that multi-modal data in the source domain and the target domain can have different semantic annotation categories. 2) Effective cross-modal benchmark data-set containing multiple modal data and with a certain volume for the complex algorithm verification to promote cross-modal retrieval performance with huge data is limited. 3) Labeling all cross-modal data and each sample with accurate annotations is impractical; thus, using these limited and noisy multi-modal data for cross-modal retrieval will be an important research direction. 4) Researchers have designed relatively complex algorithms to improve performance, but the requirements of retrieval efficiency are difficult to satisfy. Therefore, designing efficient and high-performance cross-modal retrieval algorithm is a crucial direction. 5) Embedding different modalities into a common representation space is difficult, and extracting fragment level representation for different modal types and developing more complex fragment-level relationship modeling will be some of the future research directions.
Keywords

订阅号|日报