Current Issue Cover
结合BiLSTM和注意力机制的视频行人再识别

余晨阳, 温林凤, 杨钢, 王玉涛(东北大学信息科学与工程学院, 沈阳 110819)

摘 要
目的 跨摄像头跨场景的视频行人再识别问题是目前计算机视觉领域的一项重要任务。在现实场景中,光照变化、遮挡、观察点变化以及杂乱的背景等造成行人外观的剧烈变化,增加了行人再识别的难度。为提高视频行人再识别系统在复杂应用场景中的鲁棒性,提出了一种结合双向长短时记忆循环神经网络(BiLSTM)和注意力机制的视频行人再识别算法。方法 首先基于残差网络结构,训练卷积神经网络(CNN)学习空间外观特征,然后使用BiLSTM提取双向时间运动信息,最后通过注意力机制融合学习到的空间外观特征和时间运动信息,以形成一个有判别力的视频层次表征。结果 在两个公开的大规模数据集上与现有的其他方法进行了实验比较。在iLIDS-VID数据集中,与性能第2的方法相比,首位命中率Rank1指标提升了4.5%;在PRID2011数据集中,相比于性能第2的方法,首位命中率Rank1指标提升了3.9%。同时分别在两个数据集中进行了消融实验,实验结果验证了所提出算法的有效性。结论 提出的结合BiLSTM和注意力机制的视频行人再识别算法,能够充分利用视频序列中的信息,学习到更鲁棒的序列特征。实验结果表明,对于不同数据集,均能显著提升识别性能。
关键词
Video person reidentification based on BiLSTM and attention mechanism

Yu Chenyang, Wen Linfeng, Yang Gang, Wang Yutao(College of Information Science and Engineering, Northeastern University, Shenyang 110819, China)

Abstract
Objective Video person reidentification (re-ID) has attracted much attention due to the rapidly growing surveillance camera networks and the increasing demand of public safety. In recent years, the person reidentification task has become one of the core problems in intelligent surveillance and multimedia applications. This task aims to match the image sequences of pedestrians from non-overlapping cameras distributed at different physical locations. Given a tracklet taken from one camera, re-ID is the process of matching the person from tracklets of interest in another view. In practice, video re-ID faces several challenges. The image qualities of video frames tend to be rather low and pedestrians also exhibit a large range of pose variations because video acquisition is less constrained. Pedestrians in videos are usually moving, resulting in serious out-of-focus, blurring, and scale variations. Moreover, the same person in different videos may look different. When people move between cameras, the large appearance changes caused by environmental and geometric variations increases the difficulty of re-ID task. A lot of works has been proposed to deal with these issues. A typical video-based person re-id system first extracts the frame-wise features with deep convolutional neural networks (CNNs). The extracted features are fed into several recurrent neural networks (RNNs) to capture temporal structure information. Finally, the average or maximum temporal pooling procedure is conducted on the output RNNs to aggregate the features. However, the average pooling operation only considers the generic features of pedestrian sequences, and the specific features of samples in a sequence are neglected. While the maximum pooling operation concentrates on finding the local salient features, useful information may be abandoned. In this case, a video person re-id algorithm based on bi-directional long short-term memory (BiLSTM) and attention mechanism is proposed to make full use of temporal information and improve the robustness of person re-id systems for complex surveillance scenes. Method From the input video sequence, the proposed algorithm breaks the long sequence into short snippets and randomly selects a constant number of frames for snippets. The snippets are fed into a pre-trained CNN network to extract the feature representation of each frame. In this method, the network can learn spatial appearance representation. Sequence representation is calculated by BiLSTM according to the temporal domain, which contains temporal motion information. BiLSTM in the network causes specific information to flow forward and backward in a flexible manner, allowing the underlying temporal information interaction to be fully exploited. After feature extraction, the frame-level and sequence-level features from the probe and gallery videos are fed into dot attention network independently. After calculating the correlation (the attention weight) between the sequence and its frames, the output sequence representation is reconstructed as a weighted sum of the frames at different spatial and temporal positions in the input sequence. In the attention mechanism, the network can alleviate sample noises and poor alignments in videos. Our network is implemented on the Pytorch platform and trained with a NVIDIA GTX 1080 GPU device. All training and testing images are rescaled to a fixed size of 256×128 pixels. The ResNet-50 with the pretrained parameters on ImageNet is considered the backbone network in our system. For network parameter training, we adopt stochastic gradient descent (SGD) with a momentum of 0.9. The learning rate is initially set as 0.001 and further divided by 10 after every 20 epochs. The batch size is set at 8 for training, and the total training process lasts for 40 epochs. The whole network is trained end-to-end with a joint identification and verification manner. During the test, the query and gallery videos are encoded to the feature vectors by using the aforementioned system. To compare the re-identification performance of the proposed method with the existing advanced methods, we adopt the cumulative matching characteristics (CMC) at rank-1, rank-5, rank-10, and rank-20 on all datasets. Result The proposed network is demonstrated on two public benchmark datasets including iLIDS-VID and PRID2011. For iLIDS-VID, the 600 video sequences of 300 persons are randomly split into 50% of persons for training and 50% of persons for testing. For PRID2011, we follow the experiment setup in previous methods and only use 400 video sequences of the first 200 persons, who appear in both cameras. The experiments on these two datasets are repeated 10 times with different test/train splits, and the results are averaged to ensure stable evaluation. Rank1 (represents the proportion of the queried people) results of the two datasets are 80.5% and 87.6% respectively. In the iLIDS-VID dataset, the Rank1 is increased by 4.5% compared with the second performance method. In the PRID2011 dataset, the Rank1 is increased by 3.9% compared with the second performance method. Extensive ablation studies verify the effectiveness of BiLSTM and attention mechanism. Compared with the results that only use LSTM in iLIDS-VID and PRID2011 datasets, the Rank1 (higher is better) is increased by 10.9% and 12.7%, respectively. Conclusion This work proposes video person re-id method based on BiLSTM and attention mechanism. The proposed algorithm can effectively learn spatio-temporal features relevant for re-id task. Furthermore, the proposed BiLSTM allow temporal information not only to propagate from front to back but also in the reverse direction. The attention mechanism can adaptively select the discriminative information from the sequentially varying features. The proposed network significantly improves the recognition rate and has a practical application value. The proposed method shows improved robustness of video person re-id systems in complex scenes and outperforms several state-of-the-art approaches.
Keywords

订阅号|日报