YOLOv3和双线性特征融合的细粒度图像分类
闫子旭1, 侯志强1, 熊磊2, 刘晓义1, 余旺盛3, 马素刚1(1.西安邮电大学计算机学院, 西安 710121;2.西安交通大学电信学院, 西安 710049;3.空军工程大学信息与导航学院, 西安 710077) 摘 要
目的 细粒度图像分类是计算机视觉领域具有挑战性的课题,目的是将一个大的类别分为更详细的子类别,在工业和学术方面都有着十分广泛的研究需求。为了改善细粒度图像分类过程中不相关背景干扰和类别差异特征难以提取的问题,提出了一种将目标检测方法YOLOv3(you only look once)和双线性融合网络相结合的细粒度分类优化算法,以此提高细粒度图像分类的性能。方法 利用重新训练过的目标检测算法YOLOv3粗略确定目标在图像中的位置;使用背景抑制方法消除目标以外的信息干扰;利用融合不同通道、不同层级卷积层特征的方法对经典的细粒度分类算法双线性卷积神经网络(bilinear convolutional neural network,B-CNN)进行改进,优化分类性能,通过融合双线性网络中不同卷积层的特征向量,得到更加丰富的互补信息,从而提高细粒度分类精度。结果 实验结果表明,在CUB-200-2011(Caltech-UCSD Birds-200-2011)、Cars196和Aircrafts100数据集中,本文算法的分类准确率分别为86.3%、92.8%和89.0%,比经典的B-CNN细粒度分类算法分别提高了2.2%、1.5%和4.9%,验证了本文算法的有效性。同时,与已有细粒度图像分类算法相比也表现出一定的优势。结论 改进算法使用YOLOv3有效滤除了大量无关背景,通过特征融合方法来改进双线性卷积神经分类网络,丰富特征信息,使分类的结果更加精准。
关键词
Fine-grained classification based on bilinear feature fusion and YOLOv3
Yan Zixu1, Hou Zhiqiang1, Xiong Lei2, Liu Xiaoyi1, Yu Wangsheng3, Ma Sugang1(1.College of Computer, Xi'an University of Posts and Telecommunications, Xi'an 710121, China;2.College of Telecommunications, Xi'an Jiaotong University, Xi'an 710049, China;3.College of Information and Navigation, Air Force Engineering University, Xi'an 710077, China) Abstract
Objective Image classification is a classic topic in the field of computer vision. It can be divided into coarse-grained classification and fine-grained classification. The purpose of coarse-grained classification is to identify objects of different categories, whereas that of fine-grained image classification is to subdivide larger categories into more fine-grained categories, which in many cases have greater use value. Fine-grained image classification is a challenging research topic in computer vision. There are extensive research needs and application scenarios of fine-grained image classification in the industry and academia. Due to background interference and difficulty in extracting effective classification features, problems still exist in fine-grained classification. Compared with general image classification, fine-grained classification experiences background interference. This problem can be addressed by object detection methods. The task of object detection is to find the objects of interest in the image and determine their position and size. At present, more and more target detection methods are based on deep learning. These methods can be divided into two categories:one-stage detection method and two-stage detection method. One-stage detection method has fast detection speed, but its accuracy is slightly lower. Examples of one-stage detection method mainly include you only look once(YOLO) and single shot multibox detector(SSD). Two-stage detection method first uses region recommendation to generate candidate targets, and then it uses a convolutional neural network (CNN) to process this condition. Some of the examples of this method include R-CNN (region CNN), SPP-NET (spatial pyramid pooling convolutional network), and Faster R-CNN. Among them, YOLOv3 of the YOLO series has achieved a better balance in detection accuracy and speed compared with other commonly used target detection frameworks. Method To improve the accuracy of these detection methods, a fine-grained classification algorithm based on the fusion of YOLOv3 and bilinear features is proposed in this study. The algorithm first uses the retrained target detection algorithm YOLOv3 to coarsely locate the target. Then, a background suppression method is used to remove irrelevant background interference. Finally, the feature fusion method is used to bilinear convolutional neural networks in the classic fine-grained classification algorithm. It can find that the convolutional neural network (referred to as B-CNN (bilinear CNN)) is greatly improved. By merging the features of different convolutional layers, more abundant complementary information is obtained. We use this method to improve the accuracy. The specific operation steps are as follows:1) enter the image; 2) use YOLOv3 pre-trained model to generate discriminative regions; 3) the background suppression method removes irrelevant background interference outside the discrimination box; 4) construct a bilinear fine classification network of feature fusion, and use deep convolutional neural networks to extract features at the multi-layer convolution stage on the image; 5) the outer product operation is used to fuse the features of the convolution layers at different stages, and the obtained fusion features of the three different levels of features are connected by the concat method to obtain the final bilinear vector. Finally, the Softmax layer is used to achieve fine-grained classification. Result After adding the YOLOv3 algorithm with background suppression, the classification accuracy rates on the three datasets are 0.7%, 0.5%, and 3.1% higher than those of B-CNN, respectively, indicating that removing background interference using the YOLOv3 algorithm can effectively improve classification. After using feature fusion to optimize the B-CNN network structure, we use three datasets (namely, CUB-200-2011 (Caltech-UCSD Birds-200-2011), Stanford Cars, and fine-grained visual categorization(FGVC) Aircraft) to test the performance. The results are 1.4% and 1.2% higher than B-CNN, which indicates that the fusion of the features of different convolutional layers and the strengthening of the spatial relationship of the features can effectively improve the classification accuracy rate. After using YOLOv3 for background suppression and fusion of B-CNN, the accuracy rates reach 86.3%, 92.8%, and 89.0% in the three datasets, respectively. Compared with B-CNN algorithm, the proposed algorithm improves the accuracy by 2.2%, 1.5%, and 4.9% in the three datasets, respectively, indicating its effectiveness. For the purpose of analyzing the classification performance of the algorithm, the improved algorithm classification results also have certain advantages compared with the mainstream algorithms. Conclusion The fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion proposed in this study not only uses YOLOv3 to effectively filter out several irrelevant backgrounds to obtain discriminative regions on the image, but also improves the bilinear fine-grainedness by means of feature fusion. Classification network, so as to extract richer fine-grained features, and make the results of fine-grained image classification more accurate. This study proposes a fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion, which can remove interference from irrelevant backgrounds. At the same time, the improved feature fusion B-CNN can learn richer features, which improves to a certain extent the accuracy of fine-grained classification. Compared with the classic B-CNN algorithm, the three fine-grained datasets are better than some mainstream algorithms. On the other hand, some new fine-grained classification algorithms are constantly changing. They use a host of different deep learning models to perform fine classification in fine-grained classification, but do not use background suppression and feature fusion to extract richer fine-grained features. In the future, we will apply fusion to the new network and use different types of fusion to further improve the accuracy of fine-grained classification in this study.
Keywords
fine-grained image classification target detection background suppression feature fusion bilinear convolutional neural network (B-CNN)
|