Current Issue Cover
双核压缩激活神经网络艺术图像分类

杨秀芹, 张华熊(浙江理工大学信息学院, 杭州 310018)

摘 要
目的 为了充分提取版画、中国画、油画、水彩画和水粉画等艺术图像的整体风格和局部细节特征,实现计算机自动分类检索艺术图像的需求,提出通过双核压缩激活模块(double kernel squeeze-and-excitation,DKSE)和深度可分离卷积搭建卷积神经网络对艺术图像进行分类。方法 根据SKNet(selective kernel networks)自适应调节感受野提取图像整体与细节特征的结构特点和SENet(squeeze-and-excitation networks)增强通道特征的特点构建DKSE模块,利用DKSE模块分支上的卷积核提取输入图像的整体特征与局部细节特征;将分支上的特征图进行特征融合,并对融合后的特征图进行特征压缩和激活处理;将处理后的特征加权映射到不同分支的特征图上并进行特征融合;通过DKSE模块与深度可分离卷积搭建卷积神经网络对艺术图像进行分类。结果 使用本文网络模型对有无数据增强(5类艺术图像数据增强后共25 634幅)处理的数据分类,数据增强后的分类准确率比未增强处理的准确率高9.21%。将本文方法与其他网络模型和传统分类方法相比,本文方法的分类准确率达到86.55%,比传统分类方法高26.35%。当DKSE模块分支上的卷积核为1×1和5×5,且放在本文网络模型第3个深度可分离卷积后,分类准确率达到87.58%。结论 DKSE模块可以有效提高模型分类性能,充分提取艺术图像的整体与局部细节特征,比传统网络模型具有更好的分类准确率。
关键词
Art image classification with double kernel squeeze- and-excitation neural network

Yang Xiuqin, Zhang Huaxiong(School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China)

Abstract
Objective The development of online digital media technology has promoted the sharing and spreading of natural art images. However, given the increasing number of art images, effective classification and retrieval are urgent problems that need to be solved. In the face of massive art image data, problems may occur in traditional manual feature extraction methods, such as tagging errors and subjective tagging. Moreover, the professional requirements of classifiers are relatively high. Convolutional neural networks (CNNs) are widely used in image classification because of its automatic feature extraction characteristics. Most of these network models are used for feature extraction in key areas of photographed images. However, natural art images are different from photographed images. Specifically, the distribution of overall style features and local detail features is evidently uniform. Selective kernel networks (SKNet) can adaptively adjust their receptive field size according to the input image to select multi-scale spatial information. However, the softmax gating mechanism in the module only strengthens the dependence between the channels of the feature map after the convolution operation of the receptive field with large response to stimulus. It also ignores the role of local detail features. Squeeze-and-excitation networks (SENet) can enhance the features in different channels but cannot extract the overall features and local detail features of the input. To fully extract and enhance the overall style features and local detail features of art images and realize the automatic classification and retrieval of art images, we combine the characteristics of SKNet and SENet to build a block called double kernel squeeze-and-excitation (DKSE) module. DKSE blocks and depthwise separable convolutions are mainly used to construct a CNN to classify art images. Method SKNet can capture the overall features and local detail features with different scales. According to the multi-scale structural characteristics of SKNet, we build the DKSE module with two branches. Each branch has a different convolutional kernel to extract the overall features and the local detail features and fuse the feature maps obtained by convolution operation. Then, according to the idea of compression and excitation in SENet, the fusion feature map spatial information is compressed into the channel descriptor by global average pooling (GAP). After the GAP operation, 1×1 convolutional kernel is used to compress and activate the feature map. The weight of normalization between (0, 1) is obtained through sigmoid gating mechanism. The weight is rescaled to the feature map of the different branches. The final output of the block is obtained by fusing the rescaled feature maps. Thus, more representative art image characteristics are extracted. In this study, we choose engraving, Chinese painting, oil painting, opaque watercolor painting and watercolor painting for classification. To enhance the data of artistic images, the images with high resolution of art images are artificially extracted and cut randomly into 299×299 pixels. The modules with rich style information are then selected. After the data augmentation, a total of 25 634 images under the five kinds of art images are obtained. The CNN is constructed by multiple DKSE modules and depthwise separable convolutions to classify the five kinds of art images. In all the experiments, 80% of the art images under each kind are randomly selected as the training sets, while the remaining 20% of the datasets are used as verification sets. Our CNN is implemented in a Keras frame. The input images are resized to 299×299 pixels for training. The Adam optimizer is used in the experiments. The initial value of learning rate is 0.001, a mini-batch size of 32 is observed, and the total of training epochs is 120. In the training process, the training sets rotate randomly from 0° to 20°, and the horizontal or vertical direction is randomly shifted between 0% and 10% and flipped randomly to enhance the generalization ability of the proposed CNN. The learning rate is decreased by a factor of 10 if the accuracy of training sets does not improve after three training cycles. Result Our network model is used to classify the data with or without data enhancement processing. The accuracy of art image classification after data augmentation is 9.21% higher than that of unenhanced processing. Compared with other network models and traditional art image classification methods, the classification accuracy of our method is 86.55%, more than 26.35% higher than that of traditional art image classification methods. Compared with Inception-V4 networks, the number of parameters is approximately 33% of the number of Inception-V4 parameters, and the time spent is approximately 25%. In this study, we place the proposed DKSE module in three different positions of the network and then verify the influence of DKSE on the classification results. When the module is placed at the third depthwise separable convolution of the network model, the reduction ratio is set to 4 and the convolution kernel sizes on the branches are 1×1 and 5×5. Moreover, the classification accuracy is 87.58%, which is 1.58% higher than that of the other eight state-of-the-art network models. The classification accuracy of the reduction ratio of 4 is superior to the reduction ratio set to 16. We use gradient-weighted class activation mapping (Grad-CAM) algorithm with our network model, ours + SK model and ours + SE model, to visualize the overall features and local detail features of each kind of art images. Experimental results show that compared with the other two network models, our network model can fully extract the overall features and local detail features of art images. Conclusion Experimental results show that the proposed DKSE module can effectively improve the classification performance of the network model and fully extract the overall features and local detail features of the art images. The network model in this study has better classification accuracy than do the other CNN models.
Keywords

订阅号|日报