Current Issue Cover
  • 发布时间: 2024-08-28
  • 摘要点击次数:  49
  • 全文下载次数: 23
  • DOI:
  •  | Volume  | Number
融合通道注意力的跨尺度Transformer图像超分辨率重建

(1.南京信息工程大学;2.西北大学)

摘 要
目的 随着深度学习技术的发展,基于Transformer的网络架构被引入计算机视觉领域并取得了显著成效。针对在超分辨率任务中,Transformer模型存在特征提取模式单一、重建图像高频细节丢失和结构失真的问题,提出了一种融合通道注意力的跨尺度Transformer图像超分辨率重建模型。方法 模型由四个模块组成:浅层特征提取、跨尺度深层特征提取、多级特征融合以及高质量重建模块。浅层特征提取利用卷积处理早期图像,获得更稳定的输出;跨尺度深层特征提取利用跨尺度Transformer和强化通道注意力机制,扩大感受野并通过加权筛选提取不同尺度特征以便融合;多级特征融合模块利用强化通道注意力机制,实现对不同尺度特征通道权重的动态调整,促进模型对丰富上下文信息的学习,增强模型在图像超分辨率重建任务中的能力。结果 在Set5、Set14、BSD100、Urban100和Manga109标准数据集上的模型评估结果表明,相较于其他主流超分辨率模型,所提模型在峰值信噪比上提高了0.06dB~0.25dB,且重建图像视觉效果更好。结论 提出的融合通道注意力的跨尺度Transformer图像超分辨率重建模型,通过融合卷积特征与Transformer特征,并利用强化通道注意力机制减少图像中噪声和冗余信息,降低模型产生图像模糊失真的可能性,图像超分辨率性能有效提升,在多个公共实验数据集的测试结果验证了所提模型的有效性。
关键词
Cross-scale transformer image super-resolution reconstruction with fusion channel attention

Li yan, Dongshihao1,2,3, Zhangjiawei1,2,3, Zhaoru4, Zhengyuhui1,2,3(1.Nanjing University of Information Science &2.amp;3.Technology;4.Northwest University)

Abstract
Objective Image super-resolution reconstruction technique refers to a special method of converting from low-resolution(LR) images to high-resolution(HR) images in the same scene. In recent years, this technique has been widely used in computer vision, image processing and other fields due to its wide practical application value and far-reaching theoretical significance. In order to improve the reconstruction performance, although the model based on convolutional neural network has made great progress, most of the super-resolution network structure exists in the form of single-layer level end-to-end, which ignores the multi-layer level feature information in the process of network reconstruction and limits the reconstruction performance of the model. With the development of deep learning technology, Transformer-based network architectures have been introduced into the field of computer vision and have achieved significant results. Researchers have introduced it into the underlying vision tasks, and in the image super-resolution reconstruction task, the Transformer model suffers from a single feature extraction pattern, loss of high-frequency details in the reconstructed image, and structural distortion, and in order to solve these problems, we propose a cross-scale Transformer image super-resolution reconstruction model with fusion channel attention. Method The model consists of four modules: shallow feature extraction, cross-scale deep feature extraction, multilevel feature fusion, and a high-quality reconstruction module. Shallow feature extraction uses convolution to process early images to obtain more stable outputs, and the convolutional layer can provide stable optimisation and extraction results during early visual feature processing; the cross-scale deep feature extraction module uses the cross-scale Transformer and the enhanced channel attention mechanism to acquire features at different scales, the core of the cross-scale Transformer is the cross-scale self-attention mechanism and the gated convolutional feed-forward network, which downsamples the feature maps to different scales by scale factors, learns contextual information by using image self-similarity, and the gated convolutional network encodes spatial neighbouring pixel position information and helps to learn the local image structure, replacing the feedforward network in the traditional Transformer. A reinforced channel attention mechanism is used after the cross-scale Transformer to expand the sensory field and extract different scale features to replace the original features by weighted filtering for backward propagation. Considering that increasing the depth of the network will lead to saturation, we set the number of residual cross-scale Transformer blocks to 3 to maintain a balance between model complexity and super-resolution reconstruction performance. After stacking different scale features in the Multilevel Feature Fusion Module, we use the Enhanced Channel Attention mechanism to dynamically adjust the channel weights of different scale features to learn rich contextual information thus enhancing the network reconstruction capability. In the high-quality reconstruction module, we use convolutional layers and pixel blending methods to up-sample features to the corresponding dimensions of high-resolution images. In the training phase, we trained the model using 900 HR images from the DIV2K dataset, and the corresponding LR images were generated from the HR images using double-triple downsampling (with downsampling multiples of ×2, ×3 and ×4), and we optimise our network using Adam's algorithm with loss as our loss function. Result We performed tests on five standard datasets, Set5, Set14, BSD100, Urban100, and Manga109, and compared the performance of the proposed model with nine state-of-the-art models, including enhanced deep residual networks for single image super-resolution (EDSR), residual channel attention networks (RCAN), second-order attention network (SAN), cross-scale non-local attention (CSNLA), the cross-scale internal graph neural network (IGNN), holistic attention network (HAN), non-local sparse attention (NLSA), image restoration using swin Transformer (SwinIR), efficient long-range attention network (ELAN). We use peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics to measure the performance of these methods. Given that humans are very sensitive to the brightness of an image, we measure these metrics in the Y-channel of the image. Experimental results show that the proposed model obtains higher PSNR and SSIM values and recovers more detailed information and more accurate textures at magnification factors of ×2, ×3 and ×4. The proposed method improves 0.13-0.25dB over SwinIR and 0.07-0.21dB over ELAN on Urban100 dataset and 0.07-0.21dB over SwinIR and 0.06-0.19dB over ELAN on Manga109 dataset. We use localized attribution map (LAM) localised attribution map to further explore the model performance, and the experimental results found that the proposed model can utilise a wider range of pixel information, and the proposed model exhibits a higher diffusion index (DI) compared to SwinIR, which proves the effectiveness of the proposed model from the interpretability point of view. Conclusion The proposed cross-scale Transformer image super-resolution reconstruction model with multilevel fusion channel attention reduces noise and redundant information in the image by fusing convolutional features with Transformer features and utilising a strengthened channel attention mechanism to reduce the likelihood of the model generating image blurring and distortion, and the image super-resolution performance is effectively improved, and in a number of public experimental datasets the test results verify the effectiveness of the multi-tip model. Visually, the model is able to obtain a reconstructed image that is sharper and closer to the real image with fewer artefacts.
Keywords

订阅号|日报