Current Issue Cover
图注意力网络的场景图到图像生成模型

兰红, 刘秦邑(江西理工大学信息工程学院, 赣州 341000)

摘 要
目的 目前文本到图像的生成模型仅在具有单个对象的图像数据集上表现良好,当一幅图像涉及多个对象和关系时,生成的图像就会变得混乱。已有的解决方案是将文本描述转换为更能表示图像中场景关系的场景图结构,然后利用场景图生成图像,但是现有的场景图到图像的生成模型最终生成的图像不够清晰,对象细节不足。为此,提出一种基于图注意力网络的场景图到图像的生成模型,生成更高质量的图像。方法 模型由提取场景图特征的图注意力网络、合成场景布局的对象布局网络、将场景布局转换为生成图像的级联细化网络以及提高生成图像质量的鉴别器网络组成。图注意力网络将得到的具有更强表达能力的输出对象特征向量传递给改进的对象布局网络,合成更接近真实标签的场景布局。同时,提出使用特征匹配的方式计算图像损失,使得最终生成图像与真实图像在语义上更加相似。结果 通过在包含多个对象的COCO-Stuff图像数据集中训练模型生成64×64像素的图像,本文模型可以生成包含多个对象和关系的复杂场景图像,且生成图像的Inception Score为7.8左右,与原有的场景图到图像生成模型相比提高了0.5。结论 本文提出的基于图注意力网络的场景图到图像生成模型不仅可以生成包含多个对象和关系的复杂场景图像,而且生成图像质量更高,细节更清晰。
关键词
Image generation from scene graph with graph attention network

Lan Hong, Liu Qinyi(College of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China)

Abstract
Objective With the development of deep learning, the problem of image generation has achieved great progress. Text-to-image generation is an important research field based on deep learning image generation. A large number of related papers conducted by researchers have proposed to implement text-to-image. However, a significant limitation exists, that is, the model will behave poorly in terms of relationships when generating images involving multiple objects. The existing solution is to replace the description text with a scene graph structure that closely represents the scene relationship in the image and then use the scene graphs to generate an image. Scene graphs are the preferred structured representation between natural language and images, which is conducive to the transfer of information between objects in the graphs. Although the scene graphs to image generation model solve the problem of image generation, including multiple objects and relationships, the existing scene graphs to image generation model ultimately produce images with lower quality, and the object details are unremarkable compared with real samples. A model with improved performance should be developed to generate high-quality images and to solve large errors. Method We propose a model called image generation from scene graphs with a graph attention network (GA-SG2IM), which is an improved model implementing image generation from scene graphs, to generate high-quality images containing multiple objects and relationships. The proposed model mainly realizes image generation in three parts:First, a feature extraction network is used to realize the feature extraction of the scene graphs. The attention network of the graphs introduces the attention mechanism in the convolution network of original graphs, enabling the output object vector to have strong expression ability. The object vector is then passed to the improved object layout network for obtaining a respectful and factual scene layout. Finally, the scene layout is passed to the cascaded refinement network for obtaining the final output image. A network of discriminators consisting of an object discriminator and an image discriminator is connected to the end to ensure that the generated image is sufficiently realistic. At the same time, we use feature matching as our image loss function to ensure that the final generated and real images are similar in semantics and to obtain high-quality images. Result We use the COCO-Stuff image dataset to train and validate the proposed model. The dataset includes more than 40 000 images of different scenes, where each of them provides annotation information of the borders and segmentation masks of the objects in the image, and the annotation information can be used to synthesize and input the scene graph of the proposed model. We train the proposed model to generate 64×64 images and compare them with other image generation models to prove its feasibility. At the same time, the quantitative results of the Inception Score and the bounding box intersection over union(IoU) of the generated image are compared to determine the improvement effects of the proposed model and SG2IM(image generation from scene graph) and StackGAN models. The final experimental results show that the proposed model achieves an Inception Score of 7.8, which increases by 0.5 compared with the SG2IM model. Conclusion Qualitative experimental results show that the proposed model can realize the generation of complex scene images containing multiple objects and relationships and improves the quality of the generated images to a certain extent, making the final generated images clear and the object details evident. A machine can autonomously model its input data and takes a step toward "wisdom" when it can generate high-quality images containing multiple objects and relationships. Our next goal is to enable the proposed model for generating real-time high-resolution images, such as photographic images, which requires many theoretical supports and practical operations.
Keywords

订阅号|日报