并行生成网络的红外—可见光图像转换
摘 要
目的 针对现有图像转换方法的深度学习模型中生成式网络(generator network)结构单一化问题,改进了条件生成式对抗网络(conditional generative adversarial network,CGAN)的结构,提出了一种融合残差网络(ResNet)和稠密网络(DenseNet)两种不同结构的并行生成器网络模型。方法 构建残差、稠密生成器分支网络模型,输入红外图像,分别经过残差、稠密生成器分支网络各自生成可见光转换图像,并提出一种基于图像分割的线性插值算法,将各生成器分支网络的转换图像进行融合,获取最终的可见光转换图像;为防止小样本条件下的训练过程中出现过拟合,在判别器网络结构中插入dropout层;设计最优阈值分割目标函数,在并行生成器网络训练过程中获取最优融合参数。结果 在公共红外-可见光数据集上测试,相较于现有图像转换深度学习模型Pix2Pix和CycleGAN等,本文方法在性能指标均方误差(mean square error,MSE)和结构相似性(structural similarity index,SSIM)上均取得显著提高。结论 并行生成器网络模型有效融合了各分支网络结构的优点,图像转换结果更加准确真实。
关键词
Infrared-to-visible image translation based on parallel generator network
Yu Peilun1, Shi Quan1,2, Wang Han1,2(1.School of Information Science and Technology, Nantong University, Nantong 226019, China;2.School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China) Abstract
Objective Image-to-image translation involves the automated conversion of input data into a corresponding output image, which differs in characteristics such as color and style. Examples include converting a photograph to a sketch or a visible image to a semantic label map. Translation has various applications in the field of computer vision such facial recognition, person identification, and image dehazing. In 2014, Goodfellow proposed an image generation model based on generative adversarial networks (GANs). This algorithm uses a loss function to classify output images as authentic or fabricated while simultaneously training a generative model to minimize loss. GANs have achieved impressive image generation results using adversarial loss specifically. For example, the image-to-image translation framework Pix2Pix was developed using a GAN architecture. Pix2Pix operates by learning a conditional generative model from input-output image pairs, which is more suitable for translation tasks. In addition, U-Net has often been used as generator networks in place of conventional decoders. While Pix2Pix provides a robust framework for image translation, acquiring sufficient quantities of paired input-output training data can be challenging. In order to solve this problem, cycle-consistent adversarial networks (CycleGANs) were developed by adding an inverse mapping and cycle consistency loss to enforce the relationship between generated and input images. In addition, ResNets have been used as generators to enhance translated image quality. Pix2PixHD offers high-resolution (2 048×1 024 pixels) output using a modified multiscale generator network that includes an instance map in the training step. Although these algorithms have effectively been used for image-to-image translation and a variety of related applications, they typically adopt U-Net or ResNet generators. These single-structure networks struggle to keep high performance across multiple evaluation indicators. As such, this study presents a novel parallel stream-based generator network to increase the robustness across multiple evaluation indicators. Unlike in previous studies, this model consists of two entirely different convolutional neural network (CNN) structures. The output translated visible image of each stream is fused with a linear interpolation-based fusion method to allow for simultaneous optimization of parameters in each model. Method The proposed parallel generator network consists of one ResNet processing stream and one DenseNet processing stream, which are fused in parallel. The ResNet stream includes down-sampling and nine Res-Unit feature extraction networks. Each Res-Unit consists of a feedforward neural network exhibiting elementwise addition. Two convolution layers are skipped. Similarly, the DenseNet stream includes down-sampling and nine Den-Unit feature extraction networks. Every Den-Unit is composed of three convolutional layers and two concatenation layers. As a result, the Den-Units output a concatenation of deep feature maps produced in all three convolutional layers. To utilize the advantages of both ResNet and DenseNet streams, two generated images are segmented into low-and high-intensity image parts with an optimal intensity threshold. Then, a linear interpolation method is proposed to fuse the segmented output images of two generator streams in the R, G, B channel respectively. We also design an intensity threshold objective function to obtain optimal parameters in the generator raining process. In addition, to avoid overfitting during training under a small dataset, we modify the discriminator structure by including four convolution-dropout pairs and a convolution layer. Result We compared our model with six state-of-the-art saliency models, including CRN(cascaded refinement networks), SIMS(semi-parametric image synthesis), Pix2Pix(pixel to pixel), CycleGAN(cycle generative adversarial networks), MUNIT(multimodal unsupervised image-to-image translation) and GauGAN(group adaptive normalization generative adversarial networks), on a public dataset named "AAU(Aalborg University) RainSnow Traffic Surveillance Dataset". The experimental dataset, which was composed of 22 5-min video sequences acquired from traffic intersections in the Danish cities of Aalborg and Viborg, was used for testing purposes. This dataset was collected at seven different locations with a conventional RGB camera and a thermal camera, each with a resolution of 640×480 pixels, at 20 frames per second. The total experimental dataset consisted of 2 100 RGB-IR image pairs, and each scene was then randomly divided into training and test datasets by 80%-20%. In this study, multi-perspective evaluation results were acquired using the mean square error (MSE), structural similarity index (SSIM), gray intensity histogram correlation, and Bhattacharyya distance. The advantages of a parallel stream-based generator network were assessed by comparing the proposed parallel generator with a ResNet, DenseNet, and residual dense block (RDN)-based hybrid network. We evaluated the average MSE and SSIM values for the test data, produced using four different generators (ParaNet, ResNet, DenseNet, and RDN). The proposed method achieved an average MSE of 34.835 8, which was lower than that of ResNet, DenseNet, and hybrid RDN network. Simultaneously, the average SSIM value produced with the proposed method was 0.747 7, which was also higher than that of DenseNet, ResNet, and RDN. This result shows that the proposed parallel structure-based network produced more effective fusion results than RDB-based hybrid network structure. Moreover, comparative experiments demonstrated that parallel generator structure improves the robustness performance across multi-perspective evaluations for infrared-to-visible image translation. Compared with the six conventional methods, the MSE performance (lower is better) increased by at least 22.30%, and the SSIM (higher is better) decreased by at least 8.55%. The experimental results show that the proposed parallel generator network-based infrared-to-visible image translation deep learning model achieves high performance in terms of MSE or SSIM compared with conventional deep learning models such as CRN, SIMS, Pix2Pix, CycleGAN, MUNIT, and GauGAN. Conclusion A novel parallel stream architecture-based generator network was proposed for infrared-to-visible image translation. Unlike conventional models, the proposed parallel generator structure consists of two different network architectures:a ResNet and a DenseNet. Parallel linear combination-based fusion allowed the model to incorporate benefits from both networks simultaneously. The structure of discriminator networks used in the conditional GAN framework was also improved for training and identifying optimal ParaNet parameters. The experimental results showed that the inclusion of different networks led to increases in common assessment metrics. The MSE, SSIM, and intensity histogram similarity for the proposed parallel generator network were higher than those of existing models. In the future, this algorithm will be applied to image dehazing.
Keywords
|