并行交叉的深度卷积神经网络模型
汤鹏杰1,2,3, 王瀚漓1,2, 左凌轩1,2(1.同济大学计算机科学与技术系, 上海 201804;2.同济大学嵌入式系统与服务计算教育部重点实验室, 上海 200092;3.井冈山大学数理学院, 吉安 343009) 摘 要
目的 图像分类与识别是计算机视觉领域的经典问题,是图像检索、目标识别及视频分析理解等技术的基础。目前,基于深度卷积神经网络(CNN)的模型已经在该领域取得了重大突破,其效果远远超过了传统的基于手工特征的模型。但很多深度模型神经元和参数规模巨大,训练困难。为此根据深度CNN模型和人眼视觉原理,提出并设计了一种深度并行交叉CNN模型(PCCNN模型)。方法 该模型在Alex-Net基础上,通过两条深度CNN数据变换流,提取两组深度CNN特征;在模型顶端,经过两次混合交叉,得到1024维的图像特征向量,最后使用Softmax回归对图像进行分类识别。结果 与同类模型相比,该模型所提取的特征更具判别力,具有更好的分类识别性能;在Caltech101上top1识别精度达到63%左右,比VGG16高出近5%,比GoogLeNet高出近10%;在Caltech256上top1识别精度达到46%以上,比VGG16高出近5%,比GoogLeNet高出2.6%。结论 PCCNN模型用于图像分类与识别效果显著,在中等规模的数据集上具有比同类其他模型更好的性能,在大规模数据集上其性能有待于进一步验证;该模型也为其他深度CNN模型的设计提供了一种新的思路,即在控制深度的同时,提取更多的特征信息,提高深度模型性能。
关键词
Parallel cross deep convolution neural networks model
Tang Pengjie1,2,3, Wang Hanli1,2, Zuo Lingxuan1,2(1.Department of Computer Science and Technology, Tongji University, Shanghai 201804, China;2.Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 200092, China;3.College of Mathematical and Physical Science, Jinggangshan University, Ji'an 343009, China) Abstract
Objective The classification and recognition of images play an important role in a number of applications, such as image retrieval, object detection, and video content analysis. Nowadays, a major breakthrough has been obtained based on deep convolution neural network (CNN) model, which has surpassed state-of-the-art methods for image classification and recognition, because the features extracted by CNN models are more discriminative and contain more semantic information than the traditional approaches. However, such CNN models as Alex-Net and ZFCNN-Net are extremely simple and incapable of extracting more information for representing images, while other models such as VGG16/VGG19 and GoogLeNet always have a huge number of neurons and parameters. Method In this work, a novel model named deep parallel cross CNN (PCCNN) is proposed, which can extract more effective information from images and has less neurons and parameters than other models. Inspired by the mechanism of human vision, which has two visual pathways and optic chiasma, the proposed PCCNN is designed based on the Alex-Net, which extracts two groups of CNN features in parallel through a couple of deep CNN data transform flows. Moreover, after the first fully connected layers in each stream, the information of two streams are fused together. The fused information is forwarded to the next two fully connected layers, and then the output information is fused again for more power representation features. Finally, for image classification, the Softmax regression function is employed with a 1024D image feature vector from the fusion of the two feature groups. Note that Alex-Net is used as the base model because of its simple architecture and its need to use fewer neurons. In the PCCNN model, the first stream is the original Alex-Net, and in the second stream, 6 instead of 4 is used as the stride in the first convolutional layer. The larger stride in convolutional layer has worse performance if only a single stream is used because of the greater number of missing information. However, when the two streams are combined, the proposed model has better performance than all the other models. In addition, because a larger stride is used in the second stream, the feature maps are smaller, and the number of neurons and parameters are not greatly increased. Result Some popular public datasets, namely Caltech101, Caltech256, and Scene15, are selected to evaluate the performance of our model. At the same time, some state-of-the-art models are implemented with the same settings for comparison. Experimental results demonstrate that the proposed PCCNN model achieves better performance in terms of image classification than these models, indicating that the features extracted with the PCCNN model are more discriminative and have stronger presentation ability. On the Caltech101 dataset, the accuracy reaches approximately 63% on top1 with PCCNN model, exceeding that of the VGG16 model by about 5% and that of the GoogLeNet model by about 10%. Moreover, in terms of the Caltech256 dataset, our model also has better performance than the other models with accuracy of 46.4% on top1, surpassing those of the VGG16 and GoogLeNet models by 5% and 2.6%, respectively. However, our model has worse performance on Scene15 dataset than GoogLeNet, but still has higher accuracy than when only a single Alex-Net is used. Conclusion The proposed PCCNN model has better performance than several state-of-the-art CNN models in terms of image classification and recognition, particularly on the medium-scale datasets, but on the small-scale dataset, the proposed model does not exhibit better performance. Hence, the model should be further tested on large-scale vision tasks, such as Imagenet or SUN dataset, which is the next work that the authors are planning to do. In fact, the PCCNN model is not only applicable to image classification and recognition, but it can also provide a novel thinking methodology for deep CNN model designing. In the deep CNN model, the deeper the architecture is, the more neurons and parameters exist, and the complexity also significantly increases. Thus, increase the width of the model can be increased to match the features and obtain better performance. Although this method also leads to an increase in the number of neurons and parameters, the rate of increase is slower than when more layers are added in the single model; furthermore, the model is more in line with the human visual physiological mechanism. Finally, the PCCNN model had great extendibility.
Keywords
|