跨层多模型特征融合与因果卷积解码的图像描述
摘 要
目的 图像描述结果的准确合理性体现在模型对信息处理的两个方面,即视觉模块对特征信息提取的丰富程度和语言模块对描述复杂场景句子的处理能力。然而现有图像描述模型仅使用一个编码器对图像进行特征提取,容易造成特征信息丢失,进而无法全面理解输入图像的语义。运用RNN(recurrent neural network)或LSTM(long short-term memory)在对句子建模时容易忽略句子的基本层次结构,且对长序列单词的学习效果不佳。针对上述问题,提出一种跨层多模型特征融合与因果卷积解码的图像描述模型。方法 在视觉特征提取模块,对单个模型添加低层到高层的跨层特征融合结构,实现语义特征和细节特征之间的信息互补,训练出多个编码器对图像进行特征提取,在充分描述和表征图像语义方面起到补充作用。在语言模块中使用因果卷积对描述复杂场景的长序列单词进行建模处理,得到一组单词特征。使用attention机制将图像特征和单词特征进行连接匹配,用于学习文本信息与图像不同区域之间的相关性,最终通过预测模块结合Softmax函数得到单词的最终预测概率。结果 在MS COCO(Microsoft common objects in context)和Flickr30k两个数据集上使用不同评估方法对模型进行验证,实验结果表明本文提出的模型性能较好。反映生成单词准确率的BLEU(bilingual evaluation understudy)-1指标值高达72.1%,且在其他多个评估指标上优于其他主流对比方法,如B-4指标超过性能优越的Hard-ATT(“Hard” attention)方法6.0%,B-1和CIDEr(consensus-based image description evaluation)指标分别超过emb-g(embedding guidance)LSTM方法5.1%和13.3%,与同样使用CNN(convolutional neural network)+CNN策略的ConvCap(convdntioral captioning)方法相比,在B-1指标上本文模型提升了0.3%。结论 本文设计的模型能够有效提取和保存复杂背景图像中的语义信息,且具有处理长序列单词的能力,对图像内容的描述更准确、信息表达更丰富。
关键词
Image caption based on causal convolutional decoding with cross-layer multi-model feature fusion
Luo Huilan, Yue Liangliang(Jiangxi University of Science and Technology, Ganzhou 341000, China) Abstract
Objective The results of image captioning can be influenced by the richness of image features, but existing methods only use one encoder for feature extraction and are thereby unable to learn the semantics of images, which may lead to inaccurate captions for images with complicated content. Meanwhile, to generate accurate and reasonable captions, the ability of language modules to process sentences of complex contexts plays an important role. However, the current mainstream methods that use RNN(recurrent neural network) or LSTM(long short-term memory) tend to ignore the basic hierarchical structure of sentences and therefore do not work well in expressing long sequences of words. To address these issues, an image captioning model based on cross-layer multi-model feature fusion and causal convolutional decoding(CMFF/CD) is proposed in this paper. Method In the visual feature extraction stage, given the feature information loss during the propagation of image features in the convolutional layer, a cross-layer feature fusion structure from the low to high levels is added to realize an information complementarity between the semantic and detail features. Afterward, multiple encoders are trained to conduct feature extraction on the input image. When the information contained in an image is highly complex, these encoders play a supplementary role in fully describing and representing image semantics. Each image in the training dataset corresponds to artificially labeled sentences that are used to train the language decoder. When the sentences are longer and more complex, the learning ability of the language model is reduced, thereby presenting a challenge in learning the relationship among objects. Causal convolution can model long sequences of words to express complex contexts and is therefore used in the proposed language module to obtain the word features. An attention mechanism is then proposed to match the image features with the word features. Each word feature corresponds to an object feature in the image. The model not only accurately describes the image content but also learns the correlation between the text information and different regions of an image. The prediction probability of words is determined by the prediction module by using the Softmax function. Result The model was validated on different Microsoft common objects in context (MS COCO) and Flickr30k datasets by using various evaluation methods. The experimental results demonstrate that the proposed model has a comparable and competing performance, especially in describing complex scene images. Compared with other mainstream methods, the proposed model not only specifies the scene information of an image (e.g., restaurants) but also identifies specific objects in the scene and accurately describe their categories. Compared with attention fully convolutional network (ATT-FCN), spatial and channel-wise attention(Sca)-convolutional neural network(CNN), and part-of-speech(POS), the proposed model generates richer image information in description sentences and has a better processing effect on long-sequence words. This model can describe "toothbrush, sink/tunnel/audience", "bed trailer, bus/mother/cruise ship", and other objects in an image, whereas other models based on the CNN + LSTM architecture are unable to do such. Although the ConvCap model, which also uses the CNN + CNN architecture, can describe multiple objects in an image and assign them some attribute descriptions, the CMFF/CD model provides more accurate and detailed descriptions, such as "bread, peppers/curtain, blow dryer". In addition, while these two models are able to describe "computer", the "desktop computer" description of the proposed model is more accurate than the "black computer" description derived by the ConvCap model. Meanwhile, the sentence structure derived by the proposed model is very similar to human expression. Given the quality of sentences produced, the bilingual evaluation understudy(BLEU)-1 indicator, which reflects the accuracy of word generation, of the proposed model reaches 72.1%. This model also obtains a 6.0% higher B-4 compared with the Hard-ATT("Hard" attention) method, thereby highlighting its excellent ability in matching local image features with word vectors. The proposed model can also fully utilize local information to express the content in detail. This model also outperforms the emb-gLSTM method in terms of B-1 and CIDEr(consensus-based image description evaluation) by 5.1% and 13.3%, respectively, and the ConvCap method, which also uses the CNN + CNN strategy, in terms of B-1 by 0.3%. Conclusion The proposed captioning model can effectively extract and preserve the semantic information in complex background images and process long sequences of words. In expressing the hierarchical relationship among complex background information, the proposed model effectively uses convolutional neural networks (i.e., causal convolution) to process text information. The experimental results show that the proposed model achieves highly accurate image content descriptions and highly abundant information expression.
Keywords
image caption cross-layer feature fusion convolution decoding causal convolution attention mechanism
|