Current Issue Cover
多分辨率特征注意力融合行人再识别

沈庆, 田畅, 王家宝, 焦珊珊, 杜麟(陆军工程大学, 南京 210007)

摘 要
目的 行人再识别是实现跨摄像头识别同一行人的关键技术,面临外观、光照、姿态、背景等问题,其中区别行人个体差异的核心是行人整体和局部特征的表征。为了高效地表征行人,提出一种多分辨率特征注意力融合的行人再识别方法。方法 借助注意力机制,基于主干网络HRNet(high-resolution network),通过交错卷积构建4个不同的分支来抽取多分辨率行人图像特征,既对行人不同粒度特征进行抽取,也对不同分支特征进行交互,对行人进行高效的特征表示。结果 在Market1501、CUHK03以及DukeMTMC-ReID这3个数据集上验证了所提方法的有效性,rank1分别达到95.3%、72.8%、90.5%,mAP(mean average precision)分别达到89.2%、70.4%、81.5%。在Market1501与DukeMTMC-ReID两个数据集上实验结果超越了当前最好表现。结论 本文方法着重提升网络提取特征的能力,得到强有力的特征表示,可用于行人再识别、图像分类和目标检测等与特征提取相关的计算机视觉任务,显著提升行人再识别的准确性。
关键词
Multi-resolution feature attention fusion method for person re-identification

Shen Qing, Tian Chang, Wang Jiabao, Jiao Shanshan, Du Lin(Army Engineering University of PLA, Nanjing 210007, China)

Abstract
Objective Person re-identification (ReID) is a computer vision task of re-identifying a queried person across non-overlapping surveillance camera views developed at different locations by matching images of the person. Given the fundamental problem of intelligent surveillance analysis, person ReID has attracted increasing interest in recent years among computer vision and pattern recognition research communities. Although great progress has been made in person ReID, it still has challenges, such as occlusion, illumination, pose variance, and background clutter. The key to solving these difficulties is to efficiently design a convolutional neural network (CNN) architecture that can extract discriminative feature representations. Specifically, the architecture should be capable of compacting the “intraclass” variation (obtained from the same individual) and separating the “interclass” variation (obtained from different individuals). The algorithm process of person ReID mainly includes two stages, namely, feature extraction and distance measurement. Most contemporary studies have focused on feature extraction because a good feature can effectively distinguish different persons. Thus, the designed CNN network needs to have good representation for the global and local features of different individuals. To fully mine the information contained in the image, we fuse the features of the same image at different resolutions to obtain a stronger feature representation and develop a multi-resolution feature attention fusion method for person ReID. Method At present, mainstream person ReID methods are based on classical networks such as ResNet and VGG(visual geometry group)-Net. The main characteristic of these networks is that the resolution of the feature maps becomes increasingly smaller as the network continuously deepens. Moreover, their high-level features contain sufficient semantic information but lack spatial information. However, for tasks involving person ReID, the spatial information of an individual is necessary. The high-resolution network (HRNet) is a multi-branch network that can maintain high-resolution representations throughout the whole process. HRNet is constructed by interleaving convolutions, which are helpful for obtaining different granularity features. It also helps in the information exchange among different branches. HRNet can output four different resolution feature representations. In this study, we first evaluate the performance of different resolution feature representations. Results show that the performance of these feature representations is not consistent on different datasets. Therefore, we propose an attention module to fuse the different resolution feature representations. The attention module can generate four weights, which add up to 1. The different resolution feature representations can be updated according to different weights. The final feature representation is the accumulation of the four different updated features. Result Experiments are conducted on three ReID datasets: Market1501, CUHK03, and DukeMTMC-ReID. Results indicate that our method pushes the performance to an exceptional level compared with most existing methods. Rank-1 accuracy is achieved with the following percentages: 95.6%, 72.8%, and 90.5%. Moreover, mAP(mean average precision) scores of 89.2%, 70.4%, and 81.5% are obtained on Market1501, CUHK03, and DukeMTMC-ReID datasets, respectively. Our method achieves state-of-the-art results on the DukeMTMC-ReID dataset and yields competitive performance with the state-of-the-art methods on Market1501 and CUHK03 datasets. The mAP score of our method is also the highest on the Market1501 dataset. In the ablation study, we evaluate the influence of three situations on the performance of our model, namely, the attention module at different locations, the images with different resolutions, and the weights of different normalization methods. Results show that the behind attention mechanism is better than the front attention mechanism. In addition, the image resolution has little influence on the performance, and the Sigmoid normalization method outperforms the softmax normalization method. Conclusion In this study, we proposed a multi-resolution attention fused method for person ReID. HRNet is an original network used to extract coarse-grain and fine-grain features, which are helpful for person ReID. Through ablation study, we found that the performance of different resolution feature representations is not consistent on different datasets. Thus, we proposed an attention module to fuse the different resolution features. The attention module outputs four weights to represent the importance of the different resolution features. The fused feature was obtained by accumulating the updated features. The experiments were conducted on Market1501, CUHK03, and DukeMTMC-ReID datasets. Results showed that our method outperforms several state-of-the-art person ReID approaches and that the attention fused method improves the performance.
Keywords

订阅号|日报