多级卷积特征金字塔的细粒度食物图片识别
摘 要
目的 食物图片具有结构多变、背景干扰大、类间差异小、类内差异大等特点,比普通细粒度图片的识别难度更大。目前在食物图片识别领域,食物图片的识别与分类仍存在精度低、泛化性差等问题。为了提高食物图片的识别与分类精度,充分利用食物图片的全局与局部细节信息,本文提出了一个多级卷积特征金字塔的细粒度食物图片识别模型。方法 本文模型从整体到局部逐级提取特征,将干扰较大的背景信息丢弃,仅针对食物目标区域提取特征。模型主要由食物特征提取网络、注意力区域定位网络和特征融合网格3部分组成,并采用3级食物特征提取网络的级联结构来实现特征由全局到局部的转移。此外,针对食物图片尺度变化大的特点,本文模型在每级食物特征提取网络中加入了特征金字塔结构,提高了模型对目标大小的鲁棒性。结果 本文模型在目前主流公开的食物图片数据集Food-101、ChineseFoodNet和Food-172上进行实验,分别获得了91.4%、82.8%、90.3%的Top-1正确率,与现有方法相比提高了1%~8%。结论 本文提出了一种多级卷积神经网络食物图片识别模型,可以自动定位食物图片区分度较大的区域,融合食物图片的全局与局部特征,实现了食物图片的细粒度识别,有效提高了食物图片的识别精度。实验结果表明,该模型在目前主流食物图片数据集上取得了最好的结果。
关键词
Fine-grained food image recognition of a multi-level convolution feature pyramid
Liang Huagang, Wen Xiaoqian, Liang Dandan, Li Huaide, Ru Feng(School of Electroinics and Control Engineering, Chang'an University, Xi'an 710064, China) Abstract
Objective Food images have special characteristics, uncertainties in food appearances, complex backgrounds, inter-class similarities, and intra-class differences. Hence, these images are more difficult to identify than ordinary fine-grained pictures. Traditional food image recognition mainly uses manual design features, including color, histogram of oriented gradient (HOG), and local binary pattern (LBP), then utilizes a classifier (e.g., support vector machine (SVM)) to deal with features. However, manual design features cannot establish the connection between various features. Several integrated feature methods only superimpose numerous features; thus, the recognition accuracy on each food image data set is up to 70% only. Compared with the weak expression capability of manual design features, deep learning has a stronger feature representation capability. They both use large-scale, labeled food images to train multi-level convolutional neural network models for food image recognition to improve recognition accuracy. However, in the current method of using the sonorous convolutional neural network for food image classification, the food image is directly inputted into the convolutional neural network to extract features. The food image has a relatively complicated background information, which critically influences the recognition result. We developed a model called multi-level convolution feature pyramid for fine-grained food image recognition to improve the accuracy of food image recognition and take full advantage of the local details. Method We extracted features from the whole to local, which not only avoids the shortcomings of baseline methods but also retains the global information and local details. We extracted features only from the target areas of the food image and discarded the background information with large interference. The multi-level convolution feature pyramid model consists of three main parts, namely, food feature extraction, attention localization, and feature fusion networks. The single-level feature extraction network cannot obtain the global and local features of the food image simultaneously. We developed a three-level food feature extraction network by cascading, which can transfer features from global to local. Moreover, a feature pyramid network was constructed between the feature maps of each food feature extraction network to deal with the large variation of food image scale. To automatically locate the network to the fine-grained area, an attention area localization network was designed between the levels of the feature extraction network, and the feature extraction range was reduced from global to local. Then, the fine-grained area of the original picture was cropped, enlarged, and inputted to the next-level feature extraction network. The features extracted by each level of the feature extraction network were subsequently sent to the feature fusion network. The merged features included the global features of the food image and the detailed features of the food target. For our model, two loss functions were used to optimize the feature extraction, feature fusion, and attention localization networks. For the feature extraction and feature fusion networks, the SoftMax loss function, which is referred to as the classification loss function, was used. The inter-stage loss function was utilized for the attention area positioning network. Result We adopted step-by-step and alternating training methods to train the feature extraction and attention localization networks and the cascade model separately. We conducted experiments on current mainstream datasets of food images. Our model obtained the top accuracy rates with 91.4%, 82.8%, and 90.3% on Food-101, ChineseFoodNet, and Food-172 datasets, respectively. The implemented framework showed the best performance compared with baselines for food picture recognition, with 1%8% improvement in recognition accuracy. Moreover, we trained the model in the Food-202 dataset, which we constructed ourselves, to verify the performance of our model fully. Food-202 is a food image dataset of 202 classes, and the number of food images in each class is more than 1 000; it includes Chinese and Western food. Results show that the accuracy of the model with the feature pyramid network increased by 2.4%. Conclusion We built a fine-grained food image recognition model with a multi-level feature pyramid convolutional neural network. The model can automatically locate areas with large discrimination of food images and integrate the global and local features of food images to achieve fine-grained recognition. It can effectively enhance the accuracy of food recognition and the robustness of the target size. Experimental results show that the proposed model demonstrated better performance than the baseline models in current mainstream food image datasets.
Keywords
food picture recognition convolutional neural network attention network fine-grained recognition feature pyramid
|