从图像到语言:图像标题生成与描述
谭云兰1,2,3, 汤鹏杰1,2, 张丽1, 罗玉盘3(1.井冈山大学电子与信息工程学院, 吉安 343009;2.江西省农作物生长物联网技术工程实验室, 吉安 343009;3.井冈山大学网络信息中心, 吉安 343009) 摘 要
图像标题生成与描述的任务是通过计算机将图像自动翻译成自然语言的形式重新表达出来,该研究在人类视觉辅助、智能人机环境开发等领域具有广阔的应用前景,同时也为图像检索、高层视觉语义推理和个性化描述等任务的研究提供支撑。图像数据具有高度非线性和繁杂性,而人类自然语言较为抽象且逻辑严谨,因此让计算机自动地对图像内容进行抽象和总结,具有很大的挑战性。本文对图像简单标题生成与描述任务进行了阐述,分析了基于手工特征的图像简单描述生成方法,并对包括基于全局视觉特征、视觉特征选择与优化以及面向优化策略等基于深度特征的图像简单描述生成方法进行了梳理与总结。针对图像的精细化描述任务,分析了当前主要的图像“密集描述”与结构化描述模型与方法。此外,本文还分析了融合情感信息与个性化表达的图像描述方法。在分析与总结的过程中,指出了当前各类图像标题生成与描述方法存在的不足,提出了下一步可能的研究趋势与解决思路。对该领域常用的MS COCO2014(Microsoft common objects in context)、Flickr30K等数据集进行了详细介绍,对图像简单描述、图像密集描述与段落描述和图像情感描述等代表性模型在数据集上的性能进行了对比分析。由于视觉数据的复杂性与自然语言的抽象性,尤其是融合情感与个性化表达的图像描述任务,在相关特征提取与表征、语义词汇的选择与嵌入、数据集构建及描述评价等方面尚存在大量问题亟待解决。
关键词
From image to language: image captioning and description
Tan Yunlan1,2,3, Tang Pengjie1,2, Zhang Li1, Luo Yupan3(1.School of Electronics and Information Engineering, Jinggangshan University, Ji'an 343009, China;2.Jiangxi Engineering Laboratory of IoT Technologies for Crop Growth, Ji'an 343009, China;3.Network Information Center, Jinggangshan University, Ji'an 343009, China) Abstract
Image captioning and description belong to high-level visual understanding. They translate an image into natural language with decent words, appropriate sentence patterns, and correct grammars. The task is interesting and has wide application prospects on early education, visually impaired aid, automatic explanation, auto-reminding, development of intelligent interactive environment, and even designing of intelligent robots. They also provide support for studying image retrieval, object detection, visual semantic reasoning, and personalized description. At present, the task has attracted the attention of several researchers, and a large number of effective models have been proposed and developed. However, the task is difficult and challenging because the model has to bridge the visual information and natural language and close the semantic gap between the data with different modalities. In this work, the development timeline, popular frameworks and models, frequently used datasets, and corresponding performance of image captioning and description are surveyed comprehensively. Additionally, the remaining questions and limitations of current works are investigated and analyzed in depth. Overall, there are four parts for image captioning and description illustration in this study:1) the image simple captioning and description (one sentence is generated for an image generally), including handcraft feature-based methods and deep feature-based approaches; 2) image dense captioning (multiple but relatively independent sentences are generated in general) and refined paragraph description (paragraph with a certain structure and logic is generated generally); 3) image personalized and sentimental captioning and description (sentence with personalized style and sentimental words is generated in general); and 4) corresponding evaluation datasets, metrics, and performances of the popular models. For the first part, the research history of image captioning and description is first introduced, including template-based framework and visual semantic retrieval-based framework based on handcraft visual feature. The classical and significant works such as semantic space sharing model and visual semantic component reorganization model are described in detail. Then, the current popular works based on deep learning techniques are sorted out carefully and elaborated in great detail. According to the usage of visual information, the models for image captioning and description based on deep feature can be mainly classified into three categories:1) global visual feature-based model, 2) visual feature selection and optimization-based model, and 3) optimization strategy-oriented model. For each kind of model, the current popular works including the proposed models, superiority, and possible problems are analyzed and discussed. The models based on selected or optimized visual features such as visual attention region, attributes, and concepts as prior knowledge are usually more intuitive and show better performance, especially when advanced optimization strategies such as reinforcement learning are employed and the quality of generated sentences frequently possesses more accurate words and richer semantics, although a few methods based on global visual feature perform as good as them. Besides the models for image simple captioning and description, popular works on dense captioning and refinement description for images are presented and sorted out in the second part. The models for dense captioning generate more sentences for images and offer more detailed description. However, the semantic relevance among different visual objects, scenes, and actions is usually ignored and not embedded into the sentences although a few approaches take advantage of the possible visual relation to predict more accurate words. With regard to refined paragraph description for images, the hierarchical architecture with multiple recurrent neural network layers is the most employed basic framework, where hierarchical attention mechanism, visual attributes, and reinforcement learning strategy are also introduced into the related models to further improve the performance. However, the semantic relevance and logic among different visual objects remain to be further explored and represented, and the coherence and logicality of the generated description paragraph for images need to be further polished and refined. Additionally, in consideration of human habit of describing an image, personal experience is usually embedded into the description, and then the generated sentences often contain personalized and sentimental information. Therefore, a few significant works for personalized image captioning and sentimental description are also introduced and discussed in this paper. In particular, the discovery, representation, and embedding of personalized information and sentiment in the models are surveyed and analyzed in depth. Moreover, the limitations and problems about the task including the granularity and intensity of sentiments, the evaluation metrics of personalized and sentimental description, are worthy of further research and exploration. In addition to classical frameworks and popular models, the related public evaluation datasets and metrics are also summarized and presented. First of all, the image simple captioning and description datasets, including Microsoft common objects in context(MS COCO2014), Flickr30K, and Flickr8K, and the performances of a few popular models on these datasets are briefly introduced. Afterward, the datasets, including Visual Genome and VG-P(Paragraph) for image dense captioning and paragraph description, and the performances of certain current works on the datasets are described and provided. Next, the datasets for the task of image description with personalized and sentimental expression, including SentiCap and FlickrStyle10K, are briefly introduced. Moreover, the performances of the main models are reported and discussed. Additionally, the frequently used evaluation methods, including traditional metrics and special targeted metrics, are described and compared. In conclusion, breakthrough has been made on image captioning and description in recent years, and the quality of generated sentences has been greatly improved. However, more efforts are still needed to generate more coherent and accurate sentences with richer semantics for images. The possible trends and solutions about image captioning and description are reconsidered and put forward in this study. To further promote the task to practical applications, the semantic gap between visual dataset and natural language should be narrowed by generating a structured paragraph with sentiment and logical semantics for images. However, several problems, including visual feature refinement and usage, sentiment and logic mining and embedding, corresponding training dataset collecting and metric designing for personalization, and sentiment and paragraph description evaluation, remain to be addressed.
Keywords
image captioning the depth feature visual description paragraph generation image sentiment logical semantic
|