自适应光流估计驱动的微表情识别
包永堂1, 武晨曦1, 张鹏1, 单彩峰2(1.山东科技大学;2.山东科技大学电气与自动化学院) 摘 要
目的 微表情识别旨在从面部肌肉应激性运动中自动分析和鉴别研究对象的情感类别,其在谎言检测、心理诊断等方面具有重要应用价值。然而,当前微表情识别方法通常依赖离线光流估计,导致微表情特征表征能力不足。针对该问题,本文提出了一种基于自适应光流估计的微表情识别模型(adaptive micro-expression. Recognition, AdaMER)。方法 AdaME并行联立实现光流估计和微表情分类两个任务自适应学习微表情相关的运动特征。首先,提出密集差分编码-解码器提取多层次面部位移信息,实现自适应光流估计;然后,借助视觉Transformer挖掘重建光流蕴含的微表情判别性信息;最后,融合面部位移微表情语义信息与微表情判别信息进行微表情分类。结果 本文在由SMIC(sptontaneous micro-expression recognition)、SAMM (spontaneous micro-facial movement dataset)和CASME II (The Chinese Academy of Sciences micro-expression)构建的复合微表情数据集上进行大量实验,结果显示本文方法UF1(unweighted F1-score)和UAR (unweighted average recall) 分别达到了82.89%和85.95%,相比于最新方法FRL-DGT (feature representation learning with adaptive displacement generation and transformer fusion)分别提升了1.77%和4.85%。结论 本文提出的方法融合了自适应光流估计与微表情分类两个任务,一方面以端到端的方式实现自适应光流估计以感知面部细微运动,提高细微表情描述能力;另一方面,充分挖掘微表情判别信息,提升微表情识别性能。
关键词
Adaptive optical flow estimation driven micro-expression recognition
(Shandong University of Science and Technology) Abstract
Objective Micro-expressions are brief, subtle facial muscle movements that accidentally signal emotions when the person tries to hide their true inner feelings. Micro-expressions are more responsive to a person"s true feelings and motivations than macro-expressions. Micro-expression recognition aims to automatically analyze and identify the emotional category of the research object from the stressful movement of the facial muscles, which has an important application value in lie detection, psychological diagnosis, and other aspects. In the early development of micro-expression recognition, local binary patterns, and optical flow were widely used as features for training traditional machine learning models. However, the traditional manual feature approach relies on manually designing rules, which makes it difficult to adapt to the differences in micro-expression data across different individuals and scenarios. Since deep learning can automatically learn the optimal feature representation of an image, the recognition performance of micro-expression recognition studies based on deep learning far exceeds that of traditional methods. However, micro-expressions occur as subtle facial changes, which causes the micro-expression recognition task to remain challenging. By analyzing the pixel movement between consecutive frames, the optical flow can represent the dynamic information of micro-expressions. Deep learning-based micro-expression recognition methods perform facial muscle motion descriptions with optical flow information to improve micro-expression recognition performance. However, existing micro-expression recognition methods usually extract the optical flow information offline, which relies on existing optical flow estimation techniques and suffers from the insufficient description of subtle expressions and neglect of static facial expression information, which restricts the recognition effect of the model. Therefore, this paper proposes a micro-expression recognition network based on adaptive optical flow estimation, which realizes the two tasks of optical flow estimation and micro-expression classification to learn micro-expression-related motion features through parallel association adaptively. Method The training samples of micro-expressions are limited, which makes it difficult to train complex network models. Therefore,this paper selects the apex and their neighboring frames in the micro-expression video sequence as training data in the preprocessing stage. In addition, when loading the data, the original training data are replaced with image pairs containing motion information in the video sequence with a certain probability. Secondly, the deep learning network with a dense differential encoder-decoder implements the facial muscle motion adaptive optical flow estimation task to improve the characterization of subtle expressions. ResNet18 extracts feature from the two-frame image and the difference map in a dense differential encoder. The branch processing the two frames shares the parameters. A motion enhancement module is added to the feature extraction branch of the differential image to accomplish the inter-layer information interaction. In the motion enhancement module, the difference map features computed from the two frames need the spatial attention mechanism to focus on the micro-expression-related motion; the two frames are subtracted from each other to preserve and amplify the difference between the two frames, and using the two features provides valid information for subsequent networks. The decoder in this paper maps the multilevel facial displacement information extracted by the dense differential encoder and the last layer of the two-frame image output features to reconstruct the optical flow features. Visual Transformer is a deep learning model based on the self-attention mechanism, which has global perception capability compared to the traditional convolutional neural network. Then, with the feature extraction capability of the visual Transformer model, the micro-expression discriminative information embedded in the reconstructed optical flow is mined. Finally, the semantic information of micro-expressions extracted from facial displacement information and the discriminative information of micro-expressions extracted from the visual Transformer model are fused to provide rich information for micro-expression classification. This paper uses the Endpoint error loss constraint for the optical flow estimation task to achieve the learning purpose, which continuously reduces the Euclidean distance between the predicted and real optical flow. Cross entropy loss function constraints are used for the features extracted by Visual Transformer and the fused features, which makes the network learn micro-expression-related information. At the same time, the image with low motion intensity in the two frames is equivalent to the neutral expression (without motion information), and the KL-divergence loss is applied to the output of the feature by the encoder to suppress irrelevant information. The loss functions interact to complete the network optimization. Result This paper evaluates the model performance on a public dataset using the Leave-One-Subject-Out (LOSO) cross-validation evaluation strategy. Face alignment and cropping are performed on the public dataset samples to unify the dataset. To demonstrate the state-of-the-art of the proposed method, we compare it with existing mainstream methods on composite datasets constructed by SMIC, SAMM, and CASME II. Our method achieves 82.89% and 85.59% UF1 and UAR on the whole dataset, 78.16% and 80.89% UF1 and UAR on the SMIC part, 94.52% and 96.02% UF1 and UAR on the CASME II part, and 73.24% and 75.83%. Our method achieved optimal results in the whole dataset, the SMIC part, and the CASME II part, and sub-optimal results in the SAMM part. In this paper, ablation experiments are conducted to prove the effectiveness of the module, and the accuracy of the proposed encoder structure is improved by 20.81% compared to the baseline; by adding feature fusion, the accuracy is improved by 1.5%; and by adding the KL-divergence loss constraint, the accuracy is improved by 4.2%. Conclusion The micro-expression recognition model based on adaptive optical flow estimation proposed in this paper fuses the proposed two tasks of adaptive optical flow estimation and micro-expression categorization, which, on the one hand, senses the subtle facial movements in an end-to-end manner and improves the ability of subtle expression description; on the other hand, fully exploits the micro-expression discriminative information and enhances the micro-expression performance.
Keywords
|