Current Issue Cover
空-地多视角行为识别的判别信息增量学习方法

刘文璇1, 钟忺1, 徐晓玉1, 周卓2, 江奎3, 王正2, 白翔4(1.武汉理工大学;2.武汉大学;3.哈尔滨工业大学;4.华中科技大学)

摘 要
目的 随着社会民生需求的日益增长,地面设备联合无人机等空中设备进行空-地多视角场景下的行为识别应运而生。现有方法只关注水平空间视角变化时的视角关系,忽视了垂直空间变化带来的巨大判别行为差异。由于高度不同,同一物体的观测特征差异显著,对传统的多视角动作识别方法在应对垂直空间视角变化时构成了重大挑战。方法 本文将显著的动作外观差异定义为辨别动作信息的差异,提出了一种基于判别行为信息增量DAIL (discriminative action information incremental learning)的空-地场景多视角行为识别方法,根据视角高度和信息量判别地面视角和空中视角。引入类脑“由易到难,循序渐进”的思想,分别蒸馏不同视角的判别行为信息,将地面视角(简单)样本的判别行为信息增量到空中视角(困难)样本中,辅助网络学习空中视角样本。结果 在Drone-Action和UAV两个数据集上进行了实验,对比于当前先进方法SBP(stochastic backpropagation),在两个数据集上准确率分别提高了18.0%和16.2%、对比于强基线方法,本文方法在UAV数据集上参数量减少了2.4M,计算量减少了6.9G。结论 所提出的方法证明通过使用简单样本增强复杂样本,显著提高了网络的特征学习能力。相反,尝试反向操作会导致准确性下降。本文从新角度讨论多视角行为识别任务,兼具效果和性能,在常见的高空视角数据集中优于代表性的方法且可拓展到其他多视角任务。
关键词
Discriminative information incremental learning for air-ground multi-view action recognition

(1.Wuhan University of Technology;2.Wuhan University;3.Harbin Institute of Technology)

Abstract
Objective With the increasing demand for urban security of people, ground devices are combined with air devices, such as drones, for identifying action in air-ground scenarios. Meanwhile, the extensive ground-based camera networks and a wealth of ground surveillance data can offer reliable support to these aerial surveillance devices. How to effectively utilize the mobility of these aerial devices is a topic that warrants further research. The existing methods of multi-view action recognition only focus on the difference in discriminative action information when the horizontal spatial view changes, but do not consider the difference in discriminative action information when the vertical spatial view changes. The high mobility of aerial perspectives can lead to changes in the vertical spatial perspective. According to the principles of perspective, observing the same object from different heights results in a significant change in appearance. This, in turn, causes substantial differences in the appearance of the same person""s actions when observed from high-altitude and ground-level perspectives. These significant variations in action appearance are referred to as differences in discriminative action information, and they pose a challenge for traditional multi-view action recognition methods in effectively addressing the issue of vertical spatial perspective changes. Method When the viewing perspective aligns with the objects being observed in the same horizontal spatial plane, the most comprehensive and rich discriminative action information can be observed. Networks can easily learn and comprehend this information. However, when the viewing perspective is in a different horizontal spatial plane from the observed objects, inclined perspective occurs, resulting in a significant change in action appearance. This transition from a ground-level perspective to an aerial perspective leads to insufficiently observed information and a reduction in the discriminative action information. When networks attempt to learn and understand this information, misclassifications are more likely to occur. Therefore, based on the amount of discriminative action information, ground-level perspective information can be considered as easily learned and understood simple information, while aerial perspective information can be seen as complex information that is challenging to learn and understand. In fact, the human brain follows a "progressive" learning process when dealing with various types of information, prioritizing the processing of simple information, and using the learned simple information to assist in learning complex information. In the task of vertical spatial multi-view action recognition, differences in perspectives and environmental influences lead to varying amounts of discriminative action information observed at different heights. In this chapter, we adopt a brain-like approach. We rank samples from the aerial perspective based on the amount of discriminative action information they contain. Complex samples contain less discriminative action information, and networks find them challenging to learn and understand. Simple samples contain more discriminative action information and are easier for networks to learn and comprehend. We then distill discriminative action information separately from simple and complex samples. Within the same action category, despite differences in the amount of discriminative action information between simple and complex samples, the represented action categories should have commonalities. Therefore, using the discriminative action information incremental learning method, we incrementally inject the rich discriminative action information learned from simple samples into the feature information of complex samples. This addresses the issue of complex samples carrying insufficient discriminative action information, allowing complex samples to learn more discriminative action information with the assistance of simple samples. This makes it easier for networks to learn and understand complex samples. This paper proposes a discriminative action information incremental learning (DAIL) for multi-view action recognition in complex air-ground scenes,and to distinguish the ground view from the air view based on the view height and the amount of information. This paper utilizes a neuromorphic learning knowledge “Ordered incremental progression” to distill discriminative action information for different views separately. Discriminative action information is incremented from the ground view (simple)samples into the air view (complex) samples to assist the network in learning and understanding the air view samples. Result The method is experimentally validated on two datasets, Drone-Action and UAV. Compared to the current state-of-the-art method SBP, the accuracy of the two datasets is improved by 18.0% and 16.2%,respectively. Compared with the strong baseline method, our method reduces the parameters by 2.4M and the flops by 6.9G on UAV dataset. In order to validate the effectiveness of our proposed method in scenarios involving both ground-level and aerial perspectives, we have introduced two datasets: N-UCLA (comprising samples exclusively from ground-based cameras with rich discriminative behavior information) and Drone-Action (comprising a mix of ground-level and aerial samples, where aerial samples contain relatively limited discriminative behavior information). A joint analysis of discriminative behavior information ranking was conducted on these datasets. Our findings indicate that enhancing complex samples using simpler ones significantly improves the network""s feature learning capacity. Conversely, attempting the reverse can lead to a reduction in accuracy. This observation aligns with the way the human brain processes information, embodying the concept of "progressive" learning. Conclusion In this study, we proposed a composite saliency model that contains an FCN and a traditional model and a fusion algorithm to fuse 2 kinds of saliency maps. The experiment results show that our model outperforms several state-of-the-art saliency approaches and the fusion algorithm improves the performance.
Keywords

订阅号|日报