俯视深度头肩序列行人再识别
摘 要
目的 行人再识别是指在一个或者多个相机拍摄的图像或视频中实现行人匹配的技术,广泛用于图像检索、智能安保等领域。按照相机种类和拍摄视角的不同,行人再识别算法可主要分为基于侧视角彩色相机的行人再识别算法和基于俯视角深度相机的行人再识别算法。在侧视角彩色相机场景中,行人身体的大部分表观信息可见;而在俯视角深度相机场景中,仅行人头部和肩部的结构信息可见。现有的多数算法主要针对侧视角彩色相机场景,只有少数算法可以直接应用于俯视角深度相机场景中,尤其是低分辨率场景,如公交车的车载飞行时间(time of flight,TOF)相机拍摄的视频。因此针对俯视角深度相机场景,本文提出了一种基于俯视深度头肩序列的行人再识别算法,以期提高低分辨率场景下的行人再识别精度。方法 对俯视深度头肩序列进行头部区域检测和卡尔曼滤波器跟踪,获取行人的头部图像序列,构建头部深度能量图组(head depth energy map group,HeDEMaG),并据此提取深度特征、面积特征、投影特征、傅里叶描述子和方向梯度直方图(histogram of oriented gradient,HOG)特征。计算行人之间头部深度能量图组的各特征之间的相似度,再利用经过模型学习所获得的权重系数对各特征相似度进行加权融合,从而得到相似度总分,将最大相似度对应的行人标签作为识别结果,实现行人再识别。结果 本文算法在公开的室内单人场景TVPR (top view person re-identification)数据集、自建的室内多人场景TDPI-L (top-view depth based person identification for laboratory scenarios)数据集和公交车实际场景TDPI-B (top-view depth based person identification for bus scenarios)数据集上进行了测试,使用首位匹配率(rank-1)、前5位匹配率(rank-5)、宏F1值(macro-F1)、累计匹配曲线(cumulative match characteristic,CMC)和平均耗时等5个指标来衡量算法性能。其中,rank-1、rank-5和macro-F1分别达到61%、68%和67%以上,相比于典型算法至少提高了11%。结论 本文构建了表达行人结构与行为特征的头部深度能量图组,实现了适合低分辨率行人的多特征表达;提出了基于权重学习的相似度融合,提高了识别精度,在室内单人、室内多人和公交车实际场景数据集中均取得了较好的效果。
关键词
Person re-identification based on top-view depth head and shoulder sequence
Wang Xinnian1, Liu Chunhua1, Qi Guoqing1, Zhang Shiqiang1,2(1.Dalian Maritime University, Dalian 116026, China;2.Hualuzhida Technology Co., Ltd., Dalian 116023, China) Abstract
Objective Person reidentification is an important task in video surveillance systems with a goal to establish the correspondence among images or videos of a person taken from different cameras at different times. In accordance with camera types, person re-identification algorithms can be divided into RGB camera-based and depth camera-based ones. RGB camera-based algorithms are generally based on the appearance characteristics of clothes, such as color and texture. Their performances are greatly affected by external conditions, such as illumination variations. On the contrary, depth camera-based algorithms are minimally affected by lighting conditions. Person re-identification algorithms can also be divided into side view-oriented and vertical view-oriented algorithms according to camera-shooting angle. Most body parts can be seen in side-view scenarios, whereas only the plan view of head and shoulders can be seen in vertical-view scenarios. Most existing algorithms are for side-view RGB scenarios, and only a few of them can be directly applied to top-view depth scenarios. For example, they have poor performance in the case of bus-mounted low-resolution depth cameras. Our focus is on person re-identification on depth head and shoulder sequences. Method The proposed person re-identification algorithm consists of four modules, namely, head region detection, head depth energy map group (HeDEMaG) construction, HeDEMaG-based multifeature representation and similarity computation, and learning-based score-level fusion and person re-identification. First, the head region detection module is to detect each head region in every frame. The pixel value in a depth image represents the distance between an object and the camera plane. The range that the height of a person distributes is used to roughly segment the candidate head regions. A frame-averaging model is proposed to compute the distance between floor and the camera plane for determining the height of each person with respect to floor. The person's height can be computed by subtracting floor values from the raw frame. The circularity ratio of a head region is used to remove nonhead regions from the candidate regions because the shape of a real head region is similar to a circle. Second, the HeDEMaG construction module is to describe the structural and behavioral characteristics of a walking person's head. Kalman filter and Hungarian matching method are used to track multiple persons' heads in each frame. In the walking process, the head direction may change with time. A principal component analysis(PCA)based method is used to normalize the direction of a person's head regions. Each person's normalized head image sequence is uniformly divided into Rt groups in time order to capture the structural and behavioral characteristics of a person's head in local and overall time periods. The average map of each group is called the head depth energy map, and the set of the head depth energy maps is named as HeDEMaG. Third, the HeDEMaG-based multifeature representation and similarity computation module is to extract features and compute the similarity between the probe and gallery set. The depth, area, projection maps in two directions, Fourier descriptor, and histogram of oriented gradient(HOG) feature of each head depth energy map in HeDEMaG are proposed to represent a person. The similarity on depth is defined as the ratio of the depth difference to the maximum difference between the probe and gallery set. The similarity on area is defined as the ratio of the area difference to the maximum difference between the probe and gallery set. The similarities on projections, Fourier descriptor, and HOG are computed by their correlation coefficients. Fourth, the learning-based similarity score-level fusion and person re-identification module is to identify persons according to the similarity score that is defined as a weighted version of the above-mentioned five similarity values. The fusing weights are learned from the training set by minimizing the cost function that measures the error rate of recognition. In the experiments, we use the label of the top one image in the ranked list as the predicted label of the probe. Result Experiments are conducted on a public top view person re-identification(TVPR) dataset and two self-built datasets to verify the effectiveness of the proposed algorithm. TVPR consists of videos recorded indoors using a vertical RGB-D camera, and only one person's walking behavior is recorded. We establish two datasets, namely, top-view depth based person identification for laboratory scenarios(TDPI-L) and top-view depth based person identification for bus scenarios(TDPI-B), to verify the performance on multiple persons and real-world scenarios. TDPI-L is composed of videos captured indoors by depth cameras, and more than two persons' walking is recorded in each frame. TDPI-B consists of sequences recorded by bus-mounted low-resolution time of flight(TOF) cameras. Five measures, namely, rank-1, rank-5, macro-F1, cumulative match characteristic(CMC) and average time are used to evaluate the proposed algorithm. The rank-1, rank-5, and macro-F1 of the proposed algorithm are above 61%, 68%, and 67%, respectively, which are at least 11% higher than those of the state-of-the-art algorithms. The ablation studies and the effects of tracking algorithms and parameters on the performance are also discussed. Conclusion The proposed algorithm is to identify persons in head and shoulder sequences captured by depth cameras from top views. HeDEMaG is proposed to represent the structural and behavioral characteristics of persons. A learning-based fusing weight-computing method is proposed to avoid parameter fine tuning and improve the recognition accuracy. Experimental results show that proposed algorithm outperforms the state-of-the-art algorithms on public available indoor videos and real-world low-resolution bus-mounted videos.
Keywords
depth camera top view depth head and shoulder sequence head depth energy map group (HeDEMaG) similarity fusion weights learning person re-identification
|