Current Issue Cover
引入语义匹配和语言评价的跨语言图像描述

张静1,2,3, 郭丹1,2,3, 宋培培1,2,3, 李坤1,2,3, 汪萌1,2,3(1.合肥工业大学计算机与信息学院, 合肥 230601;2.大数据知识工程教育部重点实验室(合肥工业大学), 合肥 230601;3.智能互联系统安徽省实验室(合肥工业大学), 合肥 230601)

摘 要
目的 由于缺乏图像与目标语言域的成对数据,现有的跨语言描述方法都是基于轴(源)语言转化为目标语言,由于转化过程中的语义噪音干扰,生成的句子存在不够流畅以及与图像视觉内容关联弱等问题,为此,本文提出了一种引入语义匹配和语言评价的跨语言图像描述模型。方法 首先,选择基于编码器—解码器的图像描述基准网络框架。其次,为了兼顾图像及其轴语言所包含的语义知识,构建了一个源域语义匹配模块;为了学习目标语言域的语言习惯,还构建了一个目标语言域评价模块。基于上述两个模块,对图像描述模型进行语义匹配约束和语言指导:1)图像&轴语言域语义匹配模块通过将图像、轴语言描述以及目标语言描述映射到公共嵌入空间来衡量各自模态特征表示的语义一致性。2)目标语言域评价模块依据目标语言风格,对所生成的描述句子进行语言评分。结果 针对跨语言的英文图像描述任务,本文在MS COCO (Microsoft common objects in context)数据集上进行了测试。与性能较好的方法相比,本文方法在BLEU (bilingual evaluation understudy)-2、BLEU-3、BLEU-4和METEOR (metric for evaluation of translation with explicit ordering)等4个评价指标上的得分分别提升了1.4%,1.0%,0.7%和1.3%。针对跨语言的中文图像描述任务,本文在AIC-ICC (image Chinese captioning from artificial intelligence challenge)数据集上进行了测试。与性能较好的方法相比,本文方法在BLEU-1、BLEU-2、BLEU-3、BLEU-4、METEOR和CIDEr (consensus-based image description evaluation)等6个评价指标上的评分分别提升了5.7%,2.0%,1.6%,1.3%,1.2%和3.4%。结论 本文模型中图像&轴语言域语义匹配模块引导模型学习了更丰富的语义知识,目标语言域评价模块约束模型生成更加流畅的句子,本文模型适用于跨语言图像描述生成任务。
关键词
Cross-lingual image captioning based on semantic matching and language evaluation

Zhang Jing1,2,3, Guo Dan1,2,3, Song Peipei1,2,3, Li Kun1,2,3, Wang Meng1,2,3(1.School of Computer and Information Engineering, Hefei University of Technology, Hefei 230601, China;2.Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education, Hefei 230601, China;3.Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology), Hefei 230601, China)

Abstract
Objective With the development of deep learning, image captioning has achieved great success. Image captioning can not only be applied to infant education, web search, and human-computer interaction but also can aid visual disables to obtain invisible information better. Most image captioning works have been developed for captioning in English. However, the ideal of image captioning should be extended to non-native English speakers further. The main challenge of cross-lingual image captioning is lack of paired image-caption datasets in the context of target language. It is challenged to collect a large-scale image caption dataset for the target language of each. Thanks to existing large-scale English captioning datasets and translation models, using the pivot language (e.g., English) to bridge the image and the target language (e.g., Chinese) is currently the main backbone framework for cross-lingual image captioning. However, such a language-pivoted approach is restricted by dis-fluency and poor semantic relevance to images. We facilitate a cross-lingual image captioning model based on semantic matching and language evaluation. Method First, our model is constructed via a native encoder-decoder framework, which extracts convolulional neural network(CNN)-based image features and generates the description in terms of the recurrent neural network. The pivot language (source language) descriptions are transformed into the target language sentences via a translation API, which is regarded as pseudo captioning labels of the images. Our model is initialized with pseudo-labels. However, the captions generated by the initialized model are in combination with of high-frequency vocabulary, the language style of pseudo-labels, or poor-irrelevant image content. It is worth noting that the pivot language written by humans is a correct description for the image content and contains the consistent semantics of the image. Therefore, considering the semantic guidance of the image content and pivot language, a semantic matching module is proposed based on the source corpus. Moreover, the language style of the generated captions greatly differs from the human-written target languages. To learn the language style of the target languages, a language evaluation module under the guidance of target language is proposed. The above two modules perform the constraints of semantic matching and language style on the optimization of the proposed captioning model. The methodological contributions are listed as following:1) The semantic matching module is an embedding network in terms of source-domain-related image and language labels. To coordinate the semantic matching between image, pivot language, and generated sentence, these multimodal data is mapped into the embedding space for semantic-relevant calculation. Our model can guarantee the sentence-generated semantic enhancement linked to the visual content in the image. 2) The semantic evaluation module based on corpus in the target domain encourages the style of generated sentences to resemble the target language style. Under the joint rewards of semantic matching and language evaluation, our model is optimized to generate image-related sentences better. The semantic matching reward and language evaluation reward are performed in a reinforcement learning mode. Result In order to verify the effectiveness of the proposed model, we carried out two sub-task experiments. 1) The cross-lingual English image captioning task is evaluated on the Microsoft common object in context(MS COCO) image-English dataset, which is trained under image-Chinese captioning from artificial intelligence challenge(AIC-ICC) dataset and MS COCO English corpus. Compared with the state-of-the-art method, our metric values of bilingual evaluation understudy(BLEU)-2, BLEU-3, BLEU-4, and metric for evaluation of translation with explicit ordering(METEOR) have increased by 1.4%, 1.0%, 0.7% and 1.3%, respectively. 2) The cross-lingual Chinese image captioning task is evaluated on the AIC-ICC image-Chinese dataset, which is trained under MS COCO image-English dataset and AIC-ICC Chinese corpus. Compared with the state-of-the-art method, the performances for BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, and consensus-based image description evoluation(CIDEr) have increased by 5.7%, 2.0%, 1.6%, 1.3%, 1.2%, and 3.4% respectively. Conclusion The semantic matching module yields the model to learn the relevant semantics in image and pivot language description. The language evaluation module learns the data distribution and language style of the target corpus. The semantic and language rewards have their potentials for cross-lingual image captioning, which not only optimize the semantic relevance of the sentence but also improve the fluency of the sentence further.
Keywords

订阅号|日报