Current Issue Cover
基于Transformer方法的任意风格迁移策略

孙梅婷, 代龙泉, 唐金辉(南京理工大学计算机科学与工程学院, 南京 210094)

摘 要
目的 任意风格迁移是图像处理任务的重要分支,卷积神经网络作为其常用的网络架构,能够协助内容和风格信息的提取与分离,但是受限于卷积操作感受野,只能捕获图像局部关联先验知识;而自然语言处理领域的Transformer网络能够突破距离限制,捕获长距离依赖关系,更好地建模全局信息,但是因为需要学习所有元素间的关联性,其表达能力的提高也带来了计算成本的增加。鉴于风格迁移过程与句子翻译过程的相似性,提出了一种混合网络模型,综合利用卷积神经网络和Transformer网络的优点并抑制其不足。方法 首先使用卷积神经网络提取图像高级特征,同时降低图像尺寸。随后将提取的特征送入Transformer中,求取内容特征与风格特征间的关联性,并将内容特征替换为风格特征的加权和,实现风格转换。最后使用卷积神经网络将处理好的特征映射回图像域,生成艺术化图像。结果 与5种先进的任意风格迁移方法进行定性和定量比较。在定性方面,进行用户调查,比较各方法生成图像的风格化效果,结果表明本文网络生成的风格化图像渲染效果更受用户喜爱;在定量方面,比较各方法的风格化处理速度,结果表明本文网络风格化速率排名第3,属于可接受范围内。此外,本文与现有的基于Transformer的任意风格迁移方法进行比较,突出二者间差异;对判别网络进行消融实验,表明判别网络的引入能够有效提升图像的光滑度和整洁度;最后,将本文网络应用于多种风格迁移任务,表明本文网络具有灵活性。结论 本文提出的混合网络模型,综合了卷积神经网络和Transformer网络的优点,同时引入了判别网络,使生成的风格化图像更加真实和生动。
关键词
Transformer-based multi-style information transfer in image processing

Sun Meiting, Dai Longquan, Tang Jinhui(School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China)

Abstract
Objective The multi-style transfer technique can be applied and focused on the stylized image via visual stylestransferred content image. The stylized image can be used to preserve original content structure and it has the similar features with the style image. Style transfer can be as one of the essential branches of image processing. Conventional style transfer methods are oriented for style rendering in terms of low-quality information processing. Current deep learning based convolutional neural network(CNN)has been adopted in the style transfer domain. To balance and re-integrate images’content and style information,the CNN can be used to extract content features and style features. However,due to the constrained range of visual perception of the convolutional layers,it can capture local associations only. To model its global information,the Transformer network proposed in natural language processing(NLP)can capture long-distance dependencies. However,its expression-related requirement is still challenged for computational cost because of correlationlearnt between all input elements. Furthermore,the Transformer has its slower convergence performance due to a lack of image prior. Given the similarity between the image style transfer process and the sentence translation process,we develop a CNNs-based hybrid network in terms of Transformer network. Method The network we proposed consists of four aspects in relevance to:encoding,style transformation,decoding,and the discriminative network. For network-encoded:convolutional layers are used to extract the high-level image features through lowering the image size. To exact more value of the pixels,features-extracted are more sensitive to semantic and consistent information in related to such specific objects and content in the image. However,the lower quality information of the image,such as lines and textures,is beneficial to reflect the stylistic features. Thus,the network-encoded can be added into the residual connection to enrich feature representation. For Transformer network structure-built style transformation network:it consists of three subparts:content encoder,style encoder,and the decoder. The content and style encoder is paid attention to global information-added for each of content features and style features. The decoder can used to optimize the original content features in related to the weighted sum of style features and stylized features can be generated further. For network-decoded:to up-sample stylized features back to the original image size and generate a final stylized image,topology-interpolated operation is interlinked to this sort of symmetric structure with the encoding network. For discriminative network:it is focused on distinguishing between the generated and natural style images. Result To verify the style transfer evaluation criteria,qualitative and quantitative comparative analysis is carried out with several other related flexible style transfer methods. The qualitative analysis consists of two parts:performance comparison and user experience. The performance comparison can show that the proposed network can generate more smooth and clear stylized images. The user-oriented result can show its userpreference ability as well. The results of the stylizing speed comparison in the quantitative comparison can demonstrate that the speed of the proposed network ranks is in the acceptable threshold. Additionally,the speed can keep its stability as the image size grows from 256 pixels to 512 pixels. An ablation experiment is designed to verify the effectiveness of the discriminative network as well. Its results show that the discriminative network-introduced can yield the network to get extraction ability better and generate more realistic images. To demonstrate the flexibility of the network we proposed,the trained network can be used for other related style transfer tasks,including content-style tradeoff,style interpolation,and region painting. Conclusion A hybrid network is facilitated and mutual-benefits are shown between CNNs and Transformer network. Experiment results show that the network we proposed can optimize the transferring speed further. The stylized images have its potentials for smaller image sizes(e. g. ,256 pixels)in terms of its content structure and stylistic features.
Keywords

订阅号|日报