耦合保持投影哈希跨模态检索
闵康凌1,2, 张国宾3, 王磊1,2, 李丹萍4(1.西安电子科技大学电子工程学院, 西安 710071;2.上海交通大学海洋智能装备与系统教育部重点实验室, 上海 200240;3.中国电子科技集团公司第27研究所, 郑州 450047;4.西安电子科技大学通信工程学院, 西安 710071) 摘 要
目的 基于哈希的跨模态检索方法因其检索速度快、消耗存储空间小等优势受到了广泛关注。但是由于这类算法大都将不同模态数据直接映射至共同的汉明空间,因此难以克服不同模态数据的特征表示及特征维度的较大差异性,也很难在汉明空间中同时保持原有数据的结构信息。针对上述问题,本文提出了耦合保持投影哈希跨模态检索算法。方法 为了解决跨模态数据间的异构性,先将不同模态的数据投影至各自子空间来减少模态“鸿沟”,并在子空间学习中引入图模型来保持数据间的结构一致性;为了构建不同模态之间的语义关联,再将子空间特征映射至汉明空间以得到一致的哈希码;最后引入类标约束来提升哈希码的判别性。结果 实验在3个数据集上与主流的方法进行了比较,在Wikipedia数据集中,相比于性能第2的算法,在任务图像检索文本(I to T)和任务文本检索图像(T to I)上的平均检索精度(mean average precision,mAP)值分别提升了6%和3%左右;在MIRFlickr数据集中,相比于性能第2的算法,优势分别为2%和5%左右;在Pascal Sentence数据集中,优势分别为10%和7%左右。结论 本文方法可适用于两个模态数据之间的相互检索任务,由于引入了耦合投影和图模型模块,有效提升了跨模态检索的精度。
关键词
Structure-preserving hashing with coupled projections for cross-modal retrieval
Min Kangling1,2, Zhang Guobin3, Wang Lei1,2, Li Danping4(1.School of Electronic Engineering, Xidian University, Xi'an 710071, China;2.Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China;3.The 27th Research Institute, China Electronics Technology Group Corporation, Zhengzhou 450047, China;4.School of Telecommunications Engineering, Xidian University, Xi'an 710071, China) Abstract
Objective With the rapid development of multimedia technology, the scale of multimedia data has been growing rapidly. For example, people are used to describing the things they want to show with multimedia data such as texts, images, and videos. Obtaining the relevant results of one modality using another modality is a good objective. In this sense, how to effectively perform semantic correlation analysis and measure the similarity between the data has gradually become a hot research topic. As the representation of different modal data is heterogeneous, it poses a great challenge to the cross-modal retrieval task. Hashing-based methods have received great attention in cross-modal retrieval because of its fast retrieval speed and low storage consumption. To solve the problem of heterogeneity between different modalities of the data, most of the current supervised hashing algorithms directly map different modal data into the Hamming space. However, these methods have the following limitations: 1) The data from each modality have different feature representations, and the dimensions of their feature spaces vary greatly. Therefore, it is difficult for these methods to obtain a consistent hash code by directly mapping the data from different modalities into the same Hamming space. 2) Although label information has been considered for these hashing methods, the structural information of the original data is ignored, which could result in a less-representative hash code to encode the original structural information in each modality. To solve these issues, a novel hashing algorithm called structure-preserving hashing with coupled projections (SPHCP) is proposed in this paper for cross-modal retrieval. Method Considering the heterogeneity between the cross-modal data, this algorithm first projects the data from different modalities into their respective subspaces to reduce the modal difference. A local graph model is also designed in the subspace learning to maintain the structural consistency between the samples. Then, to build a semantic relationship between different modalities, the algorithm maps the subspace features to the Hamming space to obtain a consistent hash code. At the same time, the label constraint is exploited to improve the discriminant power of the obtained hash codes. Finally, the algorithm measures the similarity of different modal data in terms of the Hamming distance. Result We compared our model with several state-of-the-art methods on three public datasets, namely, Wikipedia,MIRFlickr, and Pascal Sentence. The mean average precision (mAP) is used as the quantitative evaluation metric. We first test our method on two benchmark datasets, Wikipedia and MIRFlickr. To evaluate the impact of hash-code length on the performance of the algorithm, this experiment set the hash code length to 16, 32, 64, and 128 bits. The experimental results show that for both the text-retrieving image task and image-retrieving text task, our proposed method outperforms the existing methods in each length setting. To further measure the performance of our proposed method on the dataset with deep features, we test the algorithm on the Pascal Sentence dataset. The experimental results show that our SPHCP algorithm can also achieve higher mAP on such dataset with deep features. In general, cross-modal retrieval methods based on deep networks can handle nonlinear features well, so their retrieval accuracy is supposed to be higher than that of traditional methods, but they need much more computational power. As a “shallow” method, the proposed SPHCP algorithm is competitive with deep methods in terms of mAP. Therefore, as an interesting direction, our framework can be used in conjunction with the deep learning method in the future, i.e., using deep learning to extract the features of images and text offline, and using the SPHCP algorithm for fast retrieval. Furthermore, we analyze the parameter sensitivity of the proposed algorithm. As this algorithm has 7 parameters, a controlled variable method is used for evaluation. The experimental results show that the proposed algorithm is not sensitive to parameters, which means that the training process does not require much optimization time, making it suitable for practical application. Conclusion In this study, a novel method called SPHCP is proposed to solve the problems mentioned. First, aiming at the “modal gap” between cross-modal data, the scheme of coupled projections is applied to gradually reduce the modal difference of multimedia data. In this way, a more consistent hash code can be obtained. Second, considering the structural information and semantic discrimination of the original data, the algorithm introduces the graph model in subspace learning, which can maintain the intra-class and inter-class relationship of the samples. Finally, a label constraint is introduced to improve the discriminability of the hash code. The experiments on the benchmark datasets verify the effectiveness of the proposed algorithm. Specifically, compared with the second-best method, SPHCP achieves an improvement by 6% and 3% on Wikipedia for two retrieval tasks. On MIRFlickr, SPHCP achieves an improvement by 2% and 5%. On Pascal Sentence, the improvement is approximately 10% and 7%. However, the proposed method requires a large amount of computing power when dealing with large-scale data, because SPHCP introduces a graph model to maintain the structural information between the data. The calculation of the structural information between each sample leads to a larger computing complexity.In future research, we will introduce nonlinear feature mapping into our SPHCP framework to improve its scalability when dealing with nonlinear feature data. Furthermore, we can extend the SPHCP from a cross-modal retrieval algorithm to a multi-modal version.
Keywords
cross-modal retrieval hashing structure-preserving graph model coupled projections subspace learning
|