关键点深度特征驱动人脸表情识别
摘 要
目的 人脸关键点检测和人脸表情识别两个任务紧密相关。已有对两者结合的工作均是两个任务的直接耦合,忽略了其内在联系。针对这一问题,提出了一个多任务的深度框架,借助关键点特征识别人脸表情。方法 参考inception结构设计了一个深度网络,同时检测关键点并且识别人脸表情,网络在两个任务的监督下,更加关注关键点附近的信息,使得五官周围的特征获得较大响应值。为进一步减小人脸其他区域的噪声对表情识别的影响,利用检测到的关键点生成一张位置注意图,进一步增加五官周围特征的权重,减小人脸边缘区域的特征响应值。复杂表情引起人脸部分区域的形变,增加了关键点检测的难度,为缓解这一问题,引入了中间监督层,在第1级检测关键点的网络中增加较小权重的表情识别任务,一方面,提高复杂表情样本的关键点检测结果,另一方面,使得网络提取更多表情相关的特征。结果 在3个公开数据集:CK+(Cohn-Kanade dataset),Oulu(Oulu-CASIA NIR&VIS facial expression database)和MMI(MMI facial expression database)上与经典方法进行比较,本文方法在CK+数据集上的识别准确率取得了最高值,在Oulu和MMI数据集上的识别准确率比目前识别率最高的方法分别提升了0.14%和0.54%。结论 实验结果表明了引入关键点信息的有效性:多任务的卷积神经网络表情识别准确率高于单任务的传统卷积神经网络。同时,引入注意力模型也提升了多任务网络中表情的识别率。
关键词
Facial expression recognition based on deep facial landmark features
Wang Shanmin, Shuai Hui, Liu Qingshan(School of Automation, Nanjing University of Information Science and Technology, Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing 210044, China) Abstract
Objective Automatic facial expression recognition (FER) aims at designing a model to identify human emotions automatically from facial images. Several methods have been proposed in the past 20 years, and all the previous works can be generally divided into two categories:image-based methods and video-based methods. In this study, we propose a new image-based FER method, guided with facial landmarks. Facial expression is actually an ultimate representation of facial muscle movement, which consists of various facial action units (AUs) distributing among the facial organs. Meanwhile, the purpose of facial landmark detection is to localize the position and shape of face and facial organs. Thus, a good relationship is observed between the facial expression and facial landmark detection. Based on this observation, some works try to combine the facial expression recognition and facial landmark localization with different strategies, and most of them extract the geometric features or only pay attention to texture information around landmarks to recognize the facial expression. Although these methods achieved great results, they still have some issues. They assist the task of FER by using given facial landmarks as prior information, but internal connection between them is ignored. To solve this problem, a deep multitask framework is proposed in this study. Method A multitask network is designed to recognize facial expressions and locate facial landmarks simultaneously because both tasks pay attention to features around facial organs, including the eyebrows, eyes, nose, and mouth (points around the external counter are abandoned). However, to obtain the ground truth of facial landmarks in practices is not easy, especially in some FER benchmarks. We utilize a stacked hourglass network to detect facial landmark points first because stacked hourglass network achieves excellent performance in the task of face alignment, which was also demonstrated in the 2nd Facial Landmark Localization Competition conjunction with CVPR(IEEE Conference on Computer Vision and Pattern Recognition) 2017. The designed network has two branches, corresponding to two tasks accordingly. Considering the relationships between the two tasks, they share two convolution layers in the first. The structure of facial landmark localization is simple, including three convolution layers and a fully connected layer because it simply assists the facial expression recognition in selecting feature. For the branch of facial expression recognition, its structure is complicated, in which the inception module is introduced and convolution kernels with different size are applied to capture the multiscale features. Two tasks are optimized together with a unified loss to learn the network parameters, in which the popular distance loss and the entropy loss are designed to facial landmark localization and facial expression recognition. Although features around the facial landmarks obtain good response under the supervision of two tasks, other areas still exist some noises. For example, part collar is retained in the cropped face image, which has a bad effect on facial expression recognition. To deal with this issue, location attention maps are created with the landmarks obtained in the branch of facial landmark localization. The proposed location attention map is a weight matrix sharing the same size with the corresponding feature maps, and it indicates the importance of each position. Inspired by the stacked hourglass network, a series of heat maps is generated first by taking the coordinate of each point as the mean value and selecting an appropriate variance with Gaussian distribution. Then, the max-pooling operation is conducted to merge these maps to generate the location attention map. The generated location attention maps rely on the performance of facial landmark localization since they utilize the position of key points detetected in the first branch. Thus, valid features may be filtered out when the detected landmarks are with a large deviation. This problem can be alleviated by adjusting the variance of Gaussian distribution in the small offset, but it does not work while the predicted landmarks deviate from the ground truth greatly. Intermediate supervision is introduced to facial landmark localization to solve such a problem by adding the facial expression recognition task with a small weight. The final loss consists of three parts:intermediate supervision loss, facial landmark localization loss in the first branch, and facial expression recognition loss in the second branch. Result To validate the effectiveness of proposed method, ablation studies are conducted on three popular databases:CK+ (Cohn-Kanade dataset), Oulu (Oulu-CASIA NIR&VIS facial expression database), and MMI (MMI facial expression database). We also investigate the performance of the multitask network and single-task network to evaluate the importance of introducing the landmark localization to facial expression recognition. The experimental results demonstrate that the proposed multitask network outperforms the traditional convolution networks, and the recognition accuracy on three databases improves by 0.93%, 1.71%, and 2.92%, respectively. Experimental results also prove that generated location attention map is effective, and recognition accuracy improves by 0.14%, 2.43%, and 1.82%, respectively, on three databases. Finally, the performance on three databases reaches peak while adding intermediate supervision. Recognition accuracy on Oulu and MMI databases increases by 0.14% and 0.54%, respectively. Intermediate supervision has minimal effect on CK+ database because samples on this database are simple and predicted landmarks do not have significant deviation. Conclusion A multitask network is designed to recognize the facial expression and localize the facial landmark simultaneously, and the experimental results demonstrated that the relationship information between the task of facial expression recognition and landmark localization is useful for facial expression recognition. The proposed location attention map improved the recognition accuracy and revealed that features distributed among facial organs are powerful for facial expression recognition. Meanwhile, introduced intermediate supervision helps improve the performance of facial landmark localization so that generated location attention map can filter out noise accurately.
Keywords
facial expression recognition(FER) facial landmark detection multi-task attention model intermediate supervision
|