Current Issue Cover
深度人脸表情识别研究进展

李珊, 邓伟洪(北京邮电大学人工智能学院, 北京 100876)

摘 要
随着人脸表情识别任务逐渐从实验室受控环境转移至具有挑战性的真实世界环境,在深度学习技术的迅猛发展下,深度神经网络能够学习出具有判别能力的特征,逐渐应用于自动人脸表情识别任务。目前的深度人脸表情识别系统致力于解决以下两个问题:1)由于缺乏足量训练数据导致的过拟合问题;2)真实世界环境下其他与表情无关因素变量(例如光照、头部姿态和身份特征)带来的干扰问题。本文首先对近十年深度人脸表情识别方法的研究现状以及相关人脸表情数据库的发展进行概括。然后,将目前基于深度学习的人脸表情识别方法分为两类:静态人脸表情识别和动态人脸表情识别,并对这两类方法分别进行介绍和综述。针对目前领域内先进的深度表情识别算法,对其在常见表情数据库上的性能进行了对比并详细分析了各类算法的优缺点。最后本文对该领域的未来研究方向和机遇挑战进行了总结和展望:考虑到表情本质上是面部肌肉运动的动态活动,基于动态序列的深度表情识别网络往往能够取得比静态表情识别网络更好的识别效果。此外,结合其他表情模型如面部动作单元模型以及其他多媒体模态,如音频模态和人体生理信息能够将表情识别拓展到更具有实际应用价值的场景。
关键词
Deep facial expression recognition: a survey

Li Shan, Deng Weihong(School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Bejing 100876, China)

Abstract
Facial expression is a powerful, natural, and universal signal for human beings to convey their emotional states and intentions. Numerous studies have been conducted on automatic facial expression analysis because of its practical importance in sociable robotics, medical treatment, driver fatigue surveillance, and many other human-computer interaction systems. Various facial expression recognition (FER) systems have been explored to encode expression information from facial representations in the field of computer vision and machine learning. Traditional methods typically use handcrafted features or shallow learning for FER. However, related studies have collected training samples from challenging real-world scenarios, which implicitly promote the transition of FER from laboratory-controlled to in-the-wild settings since 2013. Meanwhile, studies in various fields have increasingly used deep learning methods, which achieve state-of-the-art recognition accuracy and remarkably exceed the results of previous investigations due to considerably improved chip processing abilities (e.g., GPU units) and appropriately designed network architectures. Moreover, deep learning techniques are increasingly utilized to handle challenging factors for emotion recognition in the wild because of the effective training of facial expression data. The transition of facial expression recognition from being laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields have promoted the use of deep neural networks to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on the following important issues. 1) Deep neural networks require a large amount of training data to avoid overfitting. However, existing facial expression databases are insufficient for training common neural networks with deep architecture, which achieve promising results in object recognition tasks. 2) Expression-unrelated variations are common in unconstrained facial expression scenarios, such as illumination, head pose, and identity bias. These disturbances are nonlinearly confounded with facial expressions and therefore strengthen the requirement of deep networks to address the large intraclass variability and learn effective expression-specific representations. We provide a comprehensive review of deep FER, including datasets and algorithms that provide insights into these intrinsic problems, in this survey. First, we introduce the background of fields of FER and summarize the development of available datasets widely used in the literature as well as FER algorithms in the past 10 years. Second, we divide the FER system into two main categories according to feature representations, namely, static image and dynamic sequence FER. The feature representation in static-based methods is encoded with only spatial information from the current single image, whereas dynamic-based methods consider temporal relations among contiguous frames in input facial expression sequences. On the basis of these two vision-based methods, other modalities, such as audio and physiological channels, have also been used in multimodal sentiment analysis systems to assist in FER. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. We introduce existing novel deep neural networks and related training strategies, which are designed for FER based on both static and dynamic image sequences, and discuss their advantages and limitations in state-of-the-art deep FER. Competitive performance and experimental comparisons of these deep FER systems in widely used benchmarks are also summarized. We then discuss relative advantages and disadvantages of these different types of methods with respect to two open issues (data size requirement and expression-unrelated variations) and other focuses (computation efficiency, performance, and network training difficulty). Finally, we review and summarize the following challenges in this field and future directions for the design of robust deep FER systems. 1) Lacking training data in terms of both quantity and quality is a main challenge in deep FER systems. Abundant sample images with diverse head poses and occlusions as well as precise face attribute labels, including expression, age, gender, and ethnicity, are crucial for practical applications. The crowdsourcing model under the guidance of expert annotators is a reasonable approach for massive annotations. 2) Data bias and inconsistent annotations are very common among different facial expression datasets due to various collecting conditions and the subjectiveness of annotating. Furthermore, the FER performance fails to improve when training data is enlarged by directly merging multiple datasets due to inconsistent expression annotations. Cross-database performance is an important evaluation criterion of generalizability and practicability of FER systems. Deep domain adaption and knowledge distillation are promising trends to address this bias. 3) Another common issue is imbalanced class distribution in facial expression due to the practicality of sample acquirement. One solution is to resample and balance the class distribution on the basis of the number of samples for each class during the preprocessing stage using data augmentation and synthesis. Another alternative is to develop a cost-sensitive loss layer for reweighting during network work training. 4) Although FER within the categorical model has been extensively investigated, the definition of prototypical expressions covers only a small portion of specific categories and cannot capture the full repertoire of expressive behavior for realistic interactions. Incorporating other affective models, such as FACS(facial action coding system) and dimensional models, can facilitate the recognition of facial expressions and allow them to learn expression-discriminative representations. 5) Human expressive behavior in realistic applications involves encoding from different perspectives, with facial expressions as only one modality. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. For example, the fusion of other modalities, such as the audio information, infrared images, and depth information from 3D face models and physiological data, has become a promising research direction due to the large complementarity of facial expressions and the good application value of human-computer interaction (HCI) applications.
Keywords

订阅号|日报