采用时空注意力机制的人脸微表情识别
摘 要
目的 微表情是人自发产生的一种面部肌肉运动,可以展现人试图掩盖的真实情绪,在安防、嫌疑人审问和心理学测试等有潜在的应用。为缓解微表情面部肌肉变化幅度小、持续时间短所带来的识别准确率低的问题,本文提出了一种用于识别微表情的时空注意力网络(spatiotemporal attention network,STANet)。方法 STANet包含一个空间注意力模块和一个时间注意力模块。首先,利用空间注意力模块使模型的注意力集中在产生微表情强度更大的区域,再利用时间注意力模块对微表情变化更大因而判别性更强的帧给予更大的权重。结果 在3个公开微表情数据集(The Chinese Academy of Sciences microexpression,CASME;CASME II;spontaneous microexpression database-high speed camera,SMIC-HS)上,使用留一交叉验证与其他8个算法进行了对比实验。实验结果表明,STANet在CASME数据集上的分类准确率相比于性能第2的模型Sparse MDMO(sparse main directional mean optical flow)提高了1.78%;在CASME II数据集上,分类准确率相比于性能第2的模型HIGO(histogram of image gradient orientation)提高了1.90%;在SMIC-HS数据集上,分类准确率达到了68.90%。结论 针对微表情肌肉幅度小、产生区域小、持续时间短的特点,本文将注意力机制用于微表情识别任务中,提出了STANet模型,使得模型将注意力集中于产生微表情幅度更大的区域和相邻帧之间变化更大的片段。
关键词
Spatiotemporal attention network for microexpression recognition
Li Guohao1, Yuan Yifan1, Ben Xianye2, Zhang Junping1(1.School of Computer Science, Fudan University, Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 200438, China;2.School of Information Science and Engineering, Shandong University, Qingdao 266237, China) Abstract
Objective Microexpression, a kind of spontaneous facial muscle movement, can conceal the real underlying emotions of people. Microexpression has potential applications in security, police interrogation, and psychological testing. Compared with macroexpression, the lower intensity and shorter duration of microexpressions increase the difficulty in recognition. Traditional methods can be divided into facial image- and optical flow-based approaches. Facial image-based methods utilize spatiotemporal partition blocks to construct feature vectors wherein spatiotemporal segmentation parameters are regarded as hyperparameters. Each sample of the dataset uses the same hyperparameters. The performance of microexpression recognition may suffer when using the same spatiotemporal division blocks for different samples, which may require varying spatiotemporal segmentation blocks. Optical flow-based methods are widely used for microexpression recognition. Although such methods demonstrate satisfactory robustness in the variation of illumination, facial features in different regions are considered equally important but ignore the appearance of microexpression in partial regions. Attention mechanism, which has been introduced in many fields, such as natural language processing and computer vision, can focus on salient regions of the object and give additional weights to these regions. We apply the attention mechanism to the microexpression recognition task and propose a spatiotemporal attention network (STANet) due to its outstanding performance in recognition tasks. Method STANet mainly consists of spatial spatial attention module (SAM) and temporal temporal attention module (TAM) attention modules. SAM is used to focus on microexpression regions with high intensity while TAM is incorporated to learn discriminative frames, which are given additional weights. Inspired by the fully convolutional network (FCN), which was proposed in semantic segmentation, we propose a spatial attention branch (SAB) in the SAM. SAB, a top-down and bottom-up structure, is a crucial component of SAM. Convolutional layers and nonlinear transformation are used to extract salient features of the microexpression in the downsampling process, followed by maximum pooling. The maximum pooling operation is utilized to reduce the resolution and increase the receptive field of the feature map. We use bilinear interpolation in the upsampling process to recover the feature map to its original size gradually and adopt skip connections to retain detailed information, which may be lost in the upsampling process. Sigmoid function is ultimately adopted after the last layer of the feature map to normalize the SAB output to [0,1]. Furthermore, we propose a temporal attention branch (TAB) to focus on the additional discriminative frames in the microexpression sequence, which are crucial in microexpression recognition. Experiments are conducted using the Chinese Academy of Sciences microexpression (CASME), the Chinese Academy of Sciences microexpression II (CASME II), and spontaneous microexpression database-high speed camera (SMIC-HS) datasets with 171, 246 and 164 samples, respectively. Corner crop and rescaling augmentations are used in CASME and CASME II to avoid overfitting. Scaling factors are set to 0.9, 1.0 and 1.1. Corner crop and horizontal flip augmentations are applied in the SMIC-HS dataset. Linear interpolation is used to interpolate samples into 20 frames because various samples have different numbers of frames. Samples are then resized to 192×192 pixels. Finally, we use FlowNet 2.0 to obtain the optical flow sequence of each frame. Experimental settings use the Adam optimizer with a learning rate of 1E-5. Weight decay coefficient is set to 1E-4 and the regularization term of coefficient λ of l1 is set to 1E-8. The number of iterations is 60, 30 and 100 for CASME, CASME II and SMIC-HS, respectively. Result We compared our model with eight state-of-the-art frameworks, including facial image- and optical flow-based methods using three public microexpression datasets, namely, CASME, CASME II and SMIC-HS. Leave-one-subject-out (LOSO) cross validation is used due to insufficient samples. We utilize classification accuracy to measure the performance of methods. The results showed that our model achieves the best performance with CASME and CASME II datasets. Our model's classification accuracy rate in the CASME dataset is 1.78% higher than the Sparse MDMO, which ranks second. The classification accuracy rate of STANet in the CASME II dataset is 1.90% higher than histogram of image gradient orientation (HIGO). The classification accuracy rate of our model in the SMIC-HS dataset is 68.90%. Ablation studies are also performed using the CASME dataset. The results verified the validity of the SAM and TAM, and the fusion algorithm can significantly improve the recognition accuracy. Conclusion STANet is proposed in this study for microexpression recognition. SAM emphasizes salient regions of the microexpression by placing additional weights on these regions. Additionally, TAM can learn large weights for clips with high variation in sequences. Experiments performed using the three public microexpression datasets illustrated that STANet achieves the highest recognition accuracy rate on CASME and CASME II datasets compared with eight other state-of-the-art methods and demonstrates satisfactory prediction performance using the SMIC-HS dataset.
Keywords
microexpression recognition classification facial feature deep learning attention mechanism spatiotemporal attention
|