Current Issue Cover
注意力引导局部特征联合学习的人脸表情识别

卢莉丹1,2, 夏海英1, 谭玉枚3, 宋树祥1(1.广西类脑计算与智能芯片重点实验室, 广西师范大学电子与信息工程学院, 桂林 541004;2.南宁理工学院大数据与人工智能学院, 南宁 530105;3.广西师范大学计算机科学与工程学院, 桂林 541004)

摘 要
目的 在复杂的自然场景下,人脸表情识别存在着眼镜、手部动作和发型等局部遮挡的问题,这些遮挡区域会降低模型的情感判别能力。因此,本文提出了一种注意力引导局部特征联合学习的人脸表情识别方法。方法 该方法由全局特征提取模块、全局特征增强模块和局部特征联合学习模块组成。全局特征提取模块用于提取中间层全局特征;全局特征增强模块用于抑制人脸识别预训练模型带来的冗余特征,并增强全局人脸图像中与情感最相关的特征图语义信息;局部特征联合学习模块利用混合注意力机制来学习不同人脸局部区域的细粒度显著特征并使用联合损失进行约束。结果 在2个自然场景数据集RAF-DB(real-world affective faces database)和FERPlus上进行了相关实验验证。在RAF-DB数据集中,识别准确率为89.24%,与MA-Net(global multi-scale and local attention network)相比有0.84%的性能提升;在FERPlus数据集中,识别准确率为90.04%,与FER-VT(FER framework with two attention mechanisms)的性能相当。实验结果表明该方法具有良好的鲁棒性。结论 本文方法通过先全局增强后局部细化的学习顺序,有效地减少了局部遮挡问题的干扰。
关键词
Attention-guided local feature joint learning for facial expression recognition

Lu Lidan1,2, Xia Haiying1, Tan Yumei3, Song Shuxiang1(1.Guangxi Key Laboratory of Brain-inspired Computing and Intelliyent Chips, School of Electronic and Information Engineering, Guangxi Normal University, Guilin 541004, China;2.College of Big Data and Artificial Intelligence, Nanning College of Technology, Nanning 530105, China;3.School of Computer Science and Engineering, Guangxi Normal University, Guilin 541004, China)

Abstract
Objective When communicating face to face, people use various methods to convey their inner emotions, such as conversational tone, body movements, and facial expressions. Among these methods, facial expression is the most direct means of observing human emotions. People can convey their thoughts and feelings through facial expression, and they can also use it to recognize others’ attitudes and inner world. Therefore, facial expression recognition belongs to one of the research directions in the field of affective computing. It can obviously be applied to many fields, such as fatigue driving detection, human–computer interaction, students’ listening state analysis, and intelligent medical services. However, in complex natural situations, facial expression recognition suffers from direct occlusion issues such as masks, sunglasses, gestures, hairstyles, or beards, as well as indirect occlusion issues such as different lighting, complex backgrounds, and pose variation. All these concerns can pose great challenges to facial expression recognition in natural scenes, where extracting discriminative features is difficult. Thus, the final recognition results are poor. Therefore, we propose an attention-guided joint learning method for local features in facial expression recognition to reduce the interference of occlusion and pose variation problems. Method Our method is composed of a global feature extraction module, a global feature enhancement module, and a joint learning module for local features. First, we use ResNet-50 as the backbone network and initialize the network parameters using the MS-Celeb-1M face recognition dataset. We think that the rich information available in the face recognition model can be used to complement the contextual information needed for facial expression recognition, especially the middle layer features such as eyes, nose, and mouth. Thus, the global feature extraction module is used to extract the global features of the middle layer, which consists of a 2D convolutional layer and three bottleneck residual convolutional blocks. Second, most of the facial expression features are concentrated in localized key regions such as eyes, nose, and mouth. Accordingly, the overall face information can be ignored and the expression categories can be directly recognized correctly with the help of local key information. Given that face recognition requires overall facial information, the face recognition pretraining model introduces some unimportant features for expression recognition. Therefore, we utilize a global feature enhancement module to suppress the redundant features (e.g., features in the nose region) brought by the pretrained model for face recognition and enhance the semantic information of global face image that is most relevant to the emotion. This module is implemented by the effcient channel attention(ECA) attention mechanism, which strengthens the channel features that contribute to the classification and weakens the weights of the channel features that are detrimental to the classification through cross-channel interactions between high-level semantic channel features. Finally, we divide the output features of the global feature enhancement module into four non-overlapping local regions uniformly in terms of spatial dimensions. This method exactly distributes the eye and mouth regions in most of the face images in four sub-image blocks. The global facial expression analysis problem is split into multiple local regions for calculation. Then, the fine-grained salient features of different localized regions of the face are learned through the mixed-attention mechanism. The local feature joint learning module learns information from complementary contexts, which reduces the negative effects of occlusion and pose variations. Considering that our method integrates four classifiers for local feature learning, a decision-level fusion strategy is used for the final prediction. That is, after summing the output probability results of the four classifiers, the category corresponding to the maximum probability is the model prediction category. Result Relevant experimental validation was performed on two in-the-wild expression datasets, namely, real-world affective faces database (RAF-DB) and face expression recognition plus (FERPlus) datasets. The results of the ablation experiments show that the gains of our method compared with the base model on the two datasets are 1.89% and 2.47%, respectively. In the RAF-DB dataset, the recognition accuracy is 89.24%, which has a performance improvement of 0.84% compared with global multi-scale and local attention network (MA-Net). In the FERPlus dataset, the recognition accuracy is 90.04%, which is comparable to the performance of FER framework with two attention mechanisms (FER-VT). Therefore, our method has good robustness. We test the model trained on the RAF-DB dataset by incorporating it with the FED-RO dataset with real occlusion and achieved an accuracy of 67.60%. We also use Grad-CAM++ to visualize the attention heatmap of the proposed model for demonstrating the effectiveness of the proposed method more intuitively. The visualization of the joint learning module for local features illustrates that the module can direct the overall model to focus on the features in each individual local image block that are useful for classification. Conclusion In general, the proposed method is guided by the attention mechanism, which enhances the global features first and then learns the salient features in the local region. This approach effectively reduces the interference of the local occlusion problem through the learning sequence of global enhancement followed by local refinement. Experiments on two natural scene datasets and the occlusion test set prove that the model is simple, effective, and robust.
Keywords

订阅号|日报