面部动作单元检测方法进展与挑战
李勇1,2, 曾加贝1, 刘昕3, 山世光1,2(1.中国科学院计算技术研究所, 北京 100190;2.中国科学院大学, 北京 100049;3.中科视拓(北京)科技有限公司, 北京 100080) 摘 要
人脸动作编码系统从人脸解剖学的角度定义了一组面部动作单元(action unit,AU),用于精确刻画人脸表情变化。每个面部动作单元描述了一组脸部肌肉运动产生的表观变化,其组合可以表达任意人脸表情。AU检测问题属于多标签分类问题,其挑战在于标注数据不足、头部姿态干扰、个体差异和不同AU的类别不均衡等。为总结近年来AU检测技术的发展,本文系统概述了2016年以来的代表性方法,根据输入数据的模态分为基于静态图像、基于动态视频以及基于其他模态的AU检测方法,并讨论在不同模态数据下为了降低数据依赖问题而引入的弱监督AU检测方法。针对静态图像,进一步介绍基于局部特征学习、AU关系建模、多任务学习以及弱监督学习的AU检测方法。针对动态视频,主要介绍基于时序特征和自监督AU特征学习的AU检测方法。最后,本文对比并总结了各代表性方法的优缺点,并在此基础上总结和讨论了面部AU检测所面临的挑战和未来发展趋势。
关键词
Progress and challenges in facial action unit detection
Li Yong1,2, Zeng Jiabei1, Liu Xin3, Shan Shiguang1,2(1.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;2.University of Chinese Academy of Sciences, Beijing 100049, China;3.Beijing Seetatech Technology Co., Ltd., Beijing 100080, China) Abstract
The anatomically based facial action coding system defines a unique set of atomic nonoverlapping facial muscle actions called action units (AUs), which can accurately characterize facial expression. AUs correspond to muscular activities that produce momentary changes in facial appearance. Combinations of AUs can represent any facial expression. As a multilabel classification problem, AU detection suffers from insufficient AU annotations, various head poses, individual differences, and imbalance among different AUs. This article systematically summarizes representative methods that have been proposed since 2016 to facilitate the development of AU detection methods. According to different input data, AU detection methods are categorized on the basis of images, videos, and other modalities. We also discuss how AU detection methods can deal with partial supervision given the large scale of unlabeled data. Image-based methods, including approaches that learn local facial representations, exploit AU relations and utilize multitask and weakly supervised learning methods. Handcrafted or automatically learned local facial representations can represent local deformations caused by active AUs. However, the former is incapable of representing different AUs with adaptive local regions while the latter suffers from insufficient training data. Approaches that exploit AU relations can utilize prior knowledge that some AUs appear together or exclusively at the same time. Such methods adopt either Bayesian or graph neural networks to model manually inferred AU relations from annotations of specified datasets. However, these inflexible methods fail to perform cross dataset evaluation. Multitask AU detection methods are inspired by the phenomena that facial shapes represented by facial landmarks are helpful in AU detection and facial deformations caused by active AUs affect the location distribution of landmarks. Except for detecting facial AUs, such methods typically estimate facial landmarks or recognize facial expressions in a multitask manner. Other tasks of facial emotion analysis, such as emotional dimension estimation, can be incorporated in the multitask learning setting. Video-based methods are categorized into strategies that rely on temporal representation and self-supervised learning. Temporal representation learning methods commonly adopt long short-term memory (LSTM) or 3D convolutional neural networks (3D-CNNs) to model the temporal information. Other temporal representation approaches utilize optical flow between frames to detect facial AUs. Several self-supervised approaches have recently exploited the prior knowledge that facial actions, which are movements of facial muscles and between facial frames, can be used as the self-supervisory signal. Such video-based weakly supervised AU detection methods are reasonable and explainable and can effectively alleviate the problem of insufficient AU annotations. However, these methods rely on massive amounts of unlabeled video data in the training phase and fail to perform AU detection in an end-to-end manner. We also review methods that exploit point cloud or thermal images for AU detection and are capable of alleviating the influence of head pose or illumination. Finally, we compare representative methods and analyze their advantages and drawbacks. The analysis summarizes and discusses challenges and potential directions of AU detection. We conclude that methods capable of utilizing weakly annotated or unlabeled data are important research directions for future investigations. Such methods should be carefully designed according to the prior knowledge of AUs to alleviate the demand for large amounts of labeled data.
Keywords
facial action unit(AU) image-based AU detection video-based AU detection weakly-supervised learning insufficient annotations
|