Current Issue Cover
细粒度图像分类的互补注意力方法

赵勋, 王家宝, 李阳, 王亚鹏, 苗壮(陆军工程大学指挥控制工程学院, 南京 210007)

摘 要
目的 由于分类对象具有细微类间差异和较大类内变化的特点,细粒度分类一直是一个具有挑战性的任务。绝大多数方法利用注意力机制学习目标中显著的局部特征。然而,传统的注意力机制往往只关注了目标最显著的局部特征,同时抑制其他区域的次级显著信息,但是这些抑制的信息中通常也含有目标的有效特征。为了充分提取目标中的有效显著特征,本文提出了一种简单而有效的互补注意力机制。方法 基于SE(squeeze-and-excitation)注意力机制,提出了一种新的注意力机制,称为互补注意力机制(complemented SE,CSE)。既从原始特征中提取主要的显著局部特征,也从抑制的剩余通道信息中提取次级显著特征,这些特征之间具有互补性,通过融合这些特征可以得到更加高效的特征表示。结果 在CUB-Birds(Caltech-UCSD Birds-200-2011)、Stanford Dogs、Stanford Cars和FGVC-Aircraft(fine-grained visual classification of aircraft)4个细粒度数据集上对所提方法进行验证,以ResNet50为主干网络,在测试集上的分类精度分别达到了87.9%、89.1%、93.9%和92.4%。实验结果表明,所提方法在CUB-Birds和Stanford Dogs两个数据集上超越了当前表现最好的方法,在Stanford Cars和FGVC-Aircraft数据集的表现也接近当前主流方法。结论 本文方法着重提升注意力机制提取特征的能力,得到高效的目标特征表示,可用于细粒度图像分类和特征提取相关的计算机视觉任务。
关键词
Complemented attention method for fine-grained image classification

Zhao Xun, Wang Jiabao, Li Yang, Wang Yapeng, Miao Zhuang(Command and Control Engineering College, Army Engineering University of PLA, Nanjing 210007, China)

Abstract
Objective Fine-grained image classification has aimed at classifying objects based on very similar categories, such as subcategories of birds, dogs and cars in comparison with coarse-grained image classification. Due to the characteristic of small inter-class variation and large intra-class variation, fine-grained image classification has been more challenging than general image classification. The key is to extract the subtle discriminative features of the object. The attention mechanism can actively learn the salient features of the target, which has been widely used in various image feature extraction tasks. The traditional attention mechanism has one obstacle that the effective characteristics of the objects. e.g., SE (squeeze-and-excitation) attention mechanism, OSME (one-squeeze multi-excitation) attention mechanism and BAM (bottleneck attention module) cannot be adequately extracted. The traditional attention mechanism has focused on the most salient features of the target and suppressed the feature representation of other regions. The suppressed regions have usually contained the effective features of the target. The feature representation can be obtained more adequate via extracting the features form the suppressed regions of object to propose a new attention mechanism, called complementary attention mechanism (complemented SE,CSE), which can extract more effective features of the target. Method A new complemented attention mechanism CSE based on the SE attention mechanism has been proposed. The complementary attention mechanism has been divided into three steps. 1) the SE attention mechanism has been used to extract the most significant discriminative features of the target and the suppressed features. 2) The SE attention mechanism to extract secondary salient features has been used for the suppressed features again. 3) Two kinds of features fusing have obtained a more efficient feature representation. Moreover, a cross-layer network structure has extracted the significance features of different layers and fused them to get the final characteristic representation for all information of the object mining. In the experimental stage, the model in PyTorch has been developed and ResNet50 (pretrained on ImageNet) as convolutional neural network(CNN) backbone has been used. The input images have been resized to 448×448 pixels for training and testing. The model has been trained using the SGD (stochastic gradient descent) with momentum of 0.9, weight decay of 0.000 5 and the learning rate of 0.001. The model has been trained for 150 epochs and the learning rate decayed by 0.1 every 30 epochs. Result In order to verify the effectiveness, the experiments have been conducted on four fine-grained datasets:CUB-Birds, Stanford Dogs, Stanford Cars and FGVC-Aircraft. The classification accuracy has been achieved with the following percentages:87.9%, 89.1%, 93.9% and 92.4%, respectively. The results have shown that the method has achieved the same effect as the state-of-the-art methods. In the ablation study, the capability of three attention mechanisms (SE, OSME and CSE) has been compared to extract features in the same conditions. The results have shown that CSE attention mechanism improved by 1.1% and 0.6%, respectively, in the CUB-Birds dataset, and improved by 1.7% and 1%, respectively, in the Stanford Dogs dataset compared with SE attention mechanism and OSME attention mechanism. The feature visualization has been conducted to see the regional features of attention mechanism more intuitive. All results have shown that CSE attention mechanism has more powerful ability of feature extraction than SE attention mechanism and OSME attention mechanism. The validity of each structure in the network on the CUB-Birds dataset has been verified. Conclusion To solve the problem of insufficient feature extraction in traditional attention mechanisms, a complemented attention method for fine-grained image classification have been proposed, which focused on improving the ability of the attention mechanism to extract features and obtaining efficient representation of target features. The CSE attention mechanism has been more concerned to discriminative regional characteristics than the SE attention mechanism and the OSME attention mechanism in ablation study.
Keywords

订阅号|日报