Current Issue Cover
深度学习视频超分辨率技术综述

江俊君1, 程豪1, 李震宇1, 刘贤明1, 王中元2(1.哈尔滨工业大学计算机科学与技术学院, 哈尔滨 150001;2.武汉大学计算机学院, 武汉 430072)

摘 要
视频超分辨率技术在卫星遥感侦测、视频监控和医疗影像等方面发挥着关键作用,在各领域具有广阔的应用前景,受到广泛关注,但传统的视频超分辨率算法具有一定局限性。随着深度学习技术的愈发成熟,基于深度神经网络的超分辨率算法在性能上取得了长足进步。充分融合视频时空信息可以快速高效地恢复真实且自然的纹理,视频超分辨率算法因其独特的优势成为一个研究热点。本文系统地对基于深度学习的视频超分辨率的研究进展进行详细综述,对基于深度学习的视频超分辨率技术的数据集和评价指标进行全面归纳,将现有视频超分辨率方法按研究思路分成两大类,即基于图像配准的视频超分辨率方法和非图像配准的视频超分辨率方法,并进一步立足于深度卷积神经网络的模型结构、模型优化历程和运动估计补偿的方法将视频超分辨率网络细分为10个子类,同时利用充足的实验数据对每种方法的核心思想以及网络结构的优缺点进行了对比分析。尽管视频超分辨率网络的重建效果在不断优化,模型参数量在逐渐降低,训练和推理速度在不断加快,然而已有的网络模型在性能上仍然存在提升的潜能。本文对基于深度学习的视频超分辨率技术存在的挑战和未来的发展前景进行了讨论。
关键词
Deep learning based video-related super-resolution technique: a survey

Jiang Junjun1, Cheng Hao1, Li Zhenyu1, Liu Xianming1, Wang Zhongyuan2(1.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;2.School of Computer, Wuhan University, Wuhan 430072, China)

Abstract
Video-related super-resolution(VSR)technique can be focused on high-resolution video profiling and restoration to optimize its low-resolution version-derived quality. It has been developing intensively in relevant to such domains like satellite remote sensing detection,video surveillance,medical imaging,and low-involved electronics. To reconstruct high-resolution frames,conventional video-relevant super-resolution methods can be used to estimate potential motion status and blur kernel parameters,which are challenged for multiscene hetegerneity. Due to the quick response ability of fully integrating video spatio-temporal information of real and natural textures,the emerging deep learning based video superresolution algorithms have been developing dramatically. We review and analyze current situation of deep learning based video super-resolution systematically and literately. First,popular YCbCr datasets are introduced like YUV25,YUV21, ultra video group(UVG),and the RGB datasets are involved in as well,such as video 4(Vid4),realistic and dynamic scenes(REDS),Vimeo90K. The profile information of each dataset is summarized,including its name,year of publication,number of videos,frame number,and resolution. Furthermore,key parameters of the video super-resolution algorithm are introduced in detail in terms of peak signal-to-noise ratio(PSNR),structural similarity(SSIM),video quality model for variable frame delay(VQM_VFD),and learned perceptual image patch similarity(LPIPS). For the concept of video super-resolution and single image super-resolution,the difference between video super-resolution and single image super-resolution can be shown and the former one has richer video frames-interrelated motion information. If the video is processed frame by frame in terms of the single image super-resolution method,there would be a large number of artifacts in the reconstructed video. We carry out deep learning based video super-resolution methods analysis and it has two key technical challenges of those are image alignment and feature integration. For image alignment,its option of image alignment module is challenged for severe hetergeneity between video super-resolution methods. Image alignment and non-alignment methods are categorized. The integration of multi-frame information is based on the network structure like generative adversarial networks(GAN),recurrent convolutional neural networks(RNN),and Transformer. To process video feature and make neighboring frames align with the target frame,image-aligned methods can use different motion estimation and motion compensation module. Image alignment methods can be segmented into three alignment-related categories:optical flow, kernel,and convolution-deformable. This optical flow alignment method can be used to calculate the motion flows between two frames through their pixels-between gray changes in temporal and the neighboring frames are warped by motion compensation module. We divide them into four categories in terms of the optical flow alignment-relevant model structure of deep convolutional neural network(CNN)further:2D convolution,RNN,GAN,and Transformer. For optical flow-aligned 2D convolution methods analysis,we mainly introduce video efficient sub-pixel convolutional network (VESPCN)and its improvement on optical flow estimation network and motion compensation network,such as ToFlow and spatial-temporal transformer network(STTN). For the RNN methods with optical flow alignment,we analyze residual recurrent convolutional network(RRCN),recurrent back-projection network(RBPN)and other related methods using optical flow to align neighboring frames at the image level,which is required to resolve the constraints of the sliding window methods. Therefore,to obtain excellent reconstruction performance,we focus on BasicVSR(basic video super-resolution),IconVSR (information-refill mechanism and coupled propagation video super-resolution)and other networks,which can warp neighboring frames at the feature level. The optical flow alignment-based TecoGAN(temporal coherence via self-supervision for gan-based video generation)and VSR Transformer methods are introduced in detail as well. Due to a few kernel-based and deformable convolution-based align methods,it is still a challenging issue for classify network structure. Because convolution kernel size can used to limit the range of motion estimation,the reconstruction performance of the kernel-based alignment methods is relatively poor. Specifically,deformable convolution is a sampling improvement of conventional convolution,which still has some gaps to be bridged like high computational complexity and harsh convergence conditions. For non-alignment methods,multiple network structures are challenged for video frames-between correlation to a certain extent. We review and analyze the methods in related to non-aligned 3D convolution,non-aligned RNN,alignmentexcluded GAN,and non-local. The non-alignment RNN methods consist of recurrent latent space propagation(RLSP), recurrent residual network(RRN)and omniscient video super-resolution(OVSR)and it demonstrates that a balance can be achieved between reconstruction speed and visual quality. To reduce the computational cost,the improved non-local module is focused on when alignment-excluded non-local methods are introduced. All models are tested with 4×downsampling using two degradations like bicubic interpolation(BI)and blur downsampling(BD). The multiple datasets-based quantitative results,speed comparison of the super-resolution methods are summarized as well,including REDS4, UDM10,and Vid4. Some effects can be optimized. The reconstruction performances of these video-based super-resolution networks are balanced in consistency,the parameters of the model are gradually shrinked,and the speed of training and reasoning is accelerated as well. However,the application of deep learning in video super-resolution is still to be facilitated more. We predict that it is necessary to improve the adaptability of the network and validate the traced result. Current deep learning technologies can be introduced on the nine aspects as mentioned below:network training and optimization,ultrahigh resolution-oriented video super-resolution for,video-compressed super-resolution video-rescaling methods,selfsupervised video super-resolution,various-scaled video super-resolution,spatio-temporal video super-resolution,auxiliary task-guided video super-resolution, and scenario-customized video super-resolution.
Keywords

订阅号|日报