融合多类型特征和多注意力机制的多分支微表情识别方法
沈天舒1, 迟静2, 王雁冰2, 雷雁蕾2, 徐铭2(1.山东财经大学;2.山东财经大学计算机科学与技术学院) 摘 要
目的 针对微表情样本不足、难以学习到多类型特征、识别精度低的问题,本文提出了一个由深度卷积神经网络与浅层3DCNN组成的、融入多个注意力模块的三分支网络。方法 该网络在深度卷积神经网络中引入深度增强通道注意力模块ECANet,在浅层3DCNN中提出新的面部特征增强注意力模块,通过三个分支分别提取和处理人脸特征、光流特征和光应变特征并逐层融合,从而能够充分利用各层特征中隐含的不同信息,减少识别过程中人脸细微特征的丢失,显著提高了微表情识别的精确度。此外,本文首次将IM Loss(Information Maximizing Loss)函数引入微表情识别模型,并与Focal Loss函数结合,使模型能够更有效地处理难以分类的样本,有效提高了模型对于难样本的识别能力。结果 在CASME II、SMIC、SAMM微表情数据集和MEGC2019复合微表情数据集上的实验表明,所提方法在微表情识别方面优于传统方法和现有的深度学习方法。结论 本文所提出的三分支网络显著提高了微表情识别的精确度,有效解决了微表情样本不足、难以学习到多类型特征的问题。
关键词
Multi-branch micro-expression recognition method based on multi-type features and multi-attention mechanism
(School of Computer Science and Technology,Shandong University of Finance and Economics) Abstract
Objective: Micro-expressions are involuntary, brief facial movements that occur when individuals experience certain emotions, often revealing their genuine feelings despite efforts to conceal them. Recognizing these expressions is valuable in a variety of real-world applications. Traditional methods for micro-expression recognition typically rely on hand-crafted features combined with classical machine learning techniques, which extract feature descriptors from original images. However, these methods often struggle with the subtle and fast spatio-temporal changes inherent in micro-expressions, leading to suboptimal recognition accuracy. Optical flow is a common technique used for micro-expression recognition as it captures small motion changes and focuses on spatio-temporal information. Despite its advantages, optical flow-based methods are computationally expensive, requiring substantial resources for video processing. Additionally, issues such as unclear boundaries between expression categories and limited sample sizes in datasets challenge the performance of current deep learning models, such as 3DCNN, which may struggle with feature learning, generalization, and model fitting. To address these issues, this paper proposes a three-branch network designed to overcome the challenges posed by insufficient micro-expression data, the difficulty in learning multi-level features, and low recognition accuracy. The proposed network integrates a deep Convolutional Neural Network (CNN) and shallow 3DCNN with multiple attention mechanisms. Method: The network incorporates an Enhanced Channel Attention Module (ECANet) within the deep CNN, which allows for more effective selection of key feature channels while reducing interference from irrelevant data. Additionally, a Facial Feature Enhancement Attention Module is proposed for the shallow 3DCNN, enhancing the model’s ability to detect subtle facial expression changes. The network uses three distinct branches to extract and process facial features, optical flow features, and optical strain features, which are progressively fused across layers. This approach maximizes the information captured at each layer, minimizing the loss of fine details and significantly improving recognition accuracy. Moreover, the paper introduces the Information Maximizing Loss (IM Loss) function into micro-expression recognition for the first time, in combination with the Focal Loss function. This weighted combination helps the model better handle challenging samples, improving recognition accuracy for difficult cases. The proposed method consists of three main stages: preprocessing, feature extraction and fusion, and micro-expression classification. In the preprocessing stage, the apex frame of each video sequence is located to identify the peak of the expression. Optical flow and optical strain maps are extracted from both the apex and onset frames. During the feature extraction and fusion stage, the first branch network utilizes ResNet-18 with the ECANet to extract facial features, while the second and third branch networks, which are shallow 3DCNN models with the Face-Enhanced Attention Module (FEAM), extract features from optical flow and optical strain maps. The features from the three branches are then fused layer-by-layer to enhance feature representation. In the final classification stage, a Softmax function is applied to predict the probabilities of each feature belonging to one of three micro-expression categories: Negative, Positive, or Surprise. The class with the highest probability is chosen as the predicted result. Result: To minimize biases during model training, this study uses Leave-One-Subject-Out (LOSO) cross-validation on the CASME II, SMIC, SAMM, and MEGC2019 micro-expression datasets. In this approach, data from one subject are reserved for testing, while the remaining data are used for training, ensuring that no information from the test subject is included during training. Evaluation metrics used include Unweighted Average Recall (UAR) and Unweighted F1-score (UF1). The experimental results show that the proposed method achieves the highest UF1 and UAR scores on the CASME II dataset, with scores of 0.9843 and 0.9857, respectively. On the SMIC dataset, the UF1 and UAR scores are 0.7671 and 0.7525, which also outperform existing methods. Ablation studies further confirm the effectiveness and robustness of the proposed approach. Conclusion: The proposed micro-expression recognition model effectively reduces the loss of subtle facial features, enhances recognition accuracy, and captures key facial features while minimizing redundant information. By introducing specialized loss functions, the model also improves its ability to accurately recognize difficult samples. Compared to other mainstream models, the proposed approach achieves state-of-the-art recognition performance, and results from ablation studies validate the robustness of the method. In conclusion, the proposed method significantly improves the efficiency and accuracy of micro-expression recognition.
Keywords
|