真实环境下的多模态情感数据集MED
摘 要
目的 情感识别的研究一直致力于帮助系统在人机交互的环节中以更合适的方式来对用户的需求进行反馈。但它在现实应用中的表现却较差。主要原因是缺乏与现实应用环境类似的大规模多模态数据集。现有的野外多模态情感数据集很少,而且受试者数量有限,使用的语言单一。方法 为了满足深度学习算法对数据量的要求,本文收集、注释并准备公开发布一个全新的自然状态下的视频数据集(multimodal emotion dataset,MED)。首先收集人员从电影、电视剧、综艺节目中手工截取视频片段,之后通过注释人员对截取视频片段的标注最终得到了1 839个视频片段。这些视频片段经过人物检测、人脸检测等操作获得有效的视频帧。该数据集包含7种基础情感和3种模态:人脸表情,身体姿态,情感语音。结果 为了提供情感识别的基准,在本文的实验部分,利用机器学习和深度学习方法对MED数据集进行了评估。首先与CK+数据集进行了对比实验,结果表明使用实验室环境下收集的数据开发算法很难应用到实际中,然后对各个模态进行了基线实验,并给出了各个模态的基线。最后多模态融合的实验结果相对于单模态的人脸表情识别提高了4.03%。结论 多模态情感数据库MED扩充了现有的真实环境下多模态数据库,以推进跨文化(语言)情感识别和对不同情感评估的感知分析等方向的研究,提高自动情感计算系统在现实应用中的表现。
关键词
MED: multimodal emotion dataset in the wild
Chen Jing, Wang Kejun, Zhao Cong, Yin Chaoqun, Huang Ziqiang(College of Automation, Harbin Engineering University, Harbin 150001, China) Abstract
Objective Emotion recognition or affective computing is crucial in various human-computer interactions, including interaction with artificial intelligence (AI) assistants, such as home robots, Google assistant, Alexa, and even self-driving cars. AI assistants or other forms of technology can also be used to identify a person's emotional or cognitive state to help people live a happy, healthy, and productive life and even help with mental health treatment. Adding emotion recognition to human-machine systems can help the computer recognize emotion and intention of users when speaking and give an appropriate response. To date, computers inaccurately capture and interpret user emotions and intentions mainly because of the different datasets used when developing an intelligent system and lack of data collection in an actual application environment that reduces system robustness. The simple dataset collected in the laboratory environment, which uses an unreasonable induction method of emotion generation, is typically characterized by a solid background and uniform and strong illumination. The resulting emotion category is very exaggerated but untrue. User age, gender, and ethnicity as well as complexity of the application environment and diversity of collection angles in the actual application process are problems that need solutions when developing a system. Therefore, application of systems developed in the laboratory environment is difficult in the real world. Method Creating a dataset from the real environment can solve the problem of inconsistency between datasets used in software development and the real-world application. Wild datasets, especially multimodal sentiment datasets containing dynamic information, are limited. Therefore, The paper collected and annotated a new multimodal emotion dataset (MED) in the real environment. First, five collectors watched videos from various data sources with different content, such as TV series, movies, talk shows, and live broadcasts, and extracted over 2 500 video clips containing emotional states. Second, the video frame of each video is obtained and saved in a folder to determine the video sequence. The pedestrian detection model is used to obtain valid video frames because only some video frames contain valid person or face information. Clips without people are considered invalid video frames and undetected. The resulting video frame containing only personal information can be used to investigate postural emotional information, such as limbs. Posture emotion can be used to assess the emotional state of a person when the face is blocked or the character has a large motion range. Facial expressions account for a large proportion of emotional judgment. Third, two methods are used to face detection. Finally, Annotators manually annotated video sequences of detected people and faces although the staff collected videos according to the emotional state in the manual cutting process. Given that humans will have deviations in emotional judgment and each person has a different sensitivity to emotion, the paper used crowdsourcing method to make annotations. Crowdsourcing methods are used in the collection of many datasets, such as ImageNet and RAF. Fifteen taggers with professional emotional information training independently tagged all the video clips. A total of 1 839 video clips were obtained on the basis of seven types of emotions after annotation. Result Different divisions of the dataset are presented in the study. The dataset is classified into training and verification sets by 0.65:0.35 according to acted facial expression in the wild(AFEW) division. The amount of data for each type of emotion in the AFEW and MED datasets are then compared and presented in the form of a graph. MED has more quantities for each type of emotion than AFEW. The paper evaluates the dataset using a large number of deep and machine learning algorithms and provides baselines for each modality. First, classic machine learning methods, such as local binary patterns(LBP), histogram of oriented gradient(HOG), and Gabor wavelet are applied to obtain the baseline of the CK+ dataset. The same method is applied to the MED dataset, and accuracy decreases by more than 50%. Data collected in the real environment is complicated. The algorithm developed using the dataset in the laboratory environment is unsuitable for the real environment. Hence, creating the dataset in the real environment is necessary. The comparison of AFEW and MED datasets verifies that data of MED are reasonable and effective. The baseline of facial expression recognition and the two other modalities also are provided. The results indicate other modalities can be used as an auxiliary method for comprehensively assessing emotions, especially when the face is blocked or the face information is unavailable. Finally, the accuracy of emotion recognition improves by 4.03% through multimodal fusion. Conclusion MED is a multimodal real-world dataset that expands the existing multimodal dataset. Researchers can develop a deep learning algorithm by combining MED with other datasets to form a large multimodal database that contains multiple languages and ethnicities, promote cross-cultural emotion recognition and perception analysis of different emotion evaluations, and improve the performance of automatic emotion computing systems in real applications.
Keywords
|