Current Issue Cover
基于深度学习的人—物交互关系检测综述

廖越1, 李智敏2, 刘偲1(1.北京航空航天大学人工智能研究院, 北京 100191;2.华中科技大学人工智能与自动化学院, 武汉 430074)

摘 要
人—物交互关系检测旨在通过精细化定位图像或视频中产生特定动作行为的人,以及与其产生交互关系的物体,并识别人和物体之间的动作关系来理解和分析人体的行为。人—物交互关系检测是一个非常具有实际应用意义和前瞻性的研究方向,是高层视觉理解的关键基石。随着深度学习的发展,基于深度学习的研究方法引领了近期人—物交互关系检测研究的进步。本文一方面分析空域人—物交互关系检测任务,从数据内容场景、标注粒度两个方面总结和分析当下数据库和基准。然后从两阶段分段式方法和单阶段端到端式方法两个流派出发系统性地阐述当前检测方法的发展现状,分析两个流派方法的特性和优劣,厘清该领域方法的发展路线。其中,两阶段方法包括多流模型和图模型两种主要范式,而单阶段模型包括基于框的范式、基于关系点的范式和基于查询的范式。另一方面,对时空域人—物交互关系检测任务进行总结,分析现有时空域交互关系数据集构造与特性和现有基线算法的优劣。最后对未来的研究方向进行展望。
关键词
A review of deep learning based human-object interaction detection

Liao Yue1, Li Zhimin2, Liu Si1(1.Institute of Artificial Intelligence, Beihang University, Beijing 100191, China;2.School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China)

Abstract
Human-object interaction (HOI) detection is essential for intelligent human behaviors analysis.Our review is focused on a fine-grain scaled image or video based human behaviors analysis through the localization of interactive human-object pairs and their recognition of interaction types.HOI detection has developed high-level visual applications like dangerous behaviors detection and human-robot interaction.Recent deep learning based methods have facilitated current HOI detection.Our critical review is carried out in terms of recent deep learning based HOI detection methods.We introduce an accelerated progress of image-level HOI detection because the growth of datasets is a key factor for the review of deep learning.First,the datasets and benchmarks of image-level HOI detection is introduced based on an annotation granularity.Therefore,the conventional image-level HOI detection datasets are assigned to three levels of instance,partial and pixel.We introduce the image collection,annotation,and statics information of every level for each dataset.Next,we analyze the conventional HOI detection methods via deep-learning-structured assignment.We summarize traditional HOI detection methods into two main folds further based on a serial architecture of two-stage fold and an end-to-end framework of one-stage fold.Two-stage methods are composed of two split serial stages,where an instance detector is initial to be used for human-object detection,and a following designed interaction classifier is applied for the interaction types reasoning between the targeted human-object detection.To clarify an accurate interaction classifier,our two-stage fold methods are mostly concerned of the two stages.However,one-stage methods are melted into an end-to-end framework,where HOI triplets can be directly detected in an end-to-end manner.Additionally,one-stage methods can also be regarded as a top-down paradigm.An anchor is designed to denote interaction and first be detected in association with human and object.Specifically,we retrace the representative methods and analyze the growth paths of such two folds.Moreover,we demonstrate the pros and cons analysis of the two folds and their potentials.At the beginning,we introduce the two-stage methods sequentially.The two-stage fold into the multi-stream pipeline and graph-based pipeline is divided based on the design of the second stage.Then,the introduced one-stage methods are split into point-based,bounding box-based,and query-based contexts in terms of multiple settings of the interaction anchor.At the end,we review the progress of zero-shot HOI detection.Meanwhile,the growth analysis of video-level HOI detection is reviewed based on datasets and methods.Finally,the future directions of HOI detection are predicted as mentioned below:1) large-scale pre-trained model-guided HOI detection:the complex HOI types are hard to be annotated for all due to multiple human-object interaction derived of various behaviors.Therefore,zero-shot HOI discovery is a challenging issue in the future.2) Self-supervised pre-training for HOI detection:it is originated from the mechanism view,where a large-scale image-text pre-trained model hypothesis can much properly benefit for HOI understanding,and 3) efficient video HOI detection:it is hard to detect video-based HOIs efficiently for conventional multi-phases detection mechanisms.Our critical analysis reviewed deep learning based human-object interaction detection tasks systematically.
Keywords

订阅号|日报