Current Issue Cover
深度空时婴幼儿表情识别模型下的ASD自动筛查

唐传高1,2, 郑文明1,2, 宗源1,2, 仇娜娜3, 闫思蒙1,2, 翟梦瑶3, 柯晓燕3(1.东南大学生物科学与医学工程学院, 南京 210096;2.儿童发展与学习科学教育部重点实验室, 南京 210096;3.南京医科大学附属脑科医院儿童心理卫生研究中心, 南京 210029)

摘 要
目的 高危孤独症谱系障碍(high-risk autism spectrum disorder,HR-ASD)筛查依赖于医师的临床评估和问卷量表,传统筛查方式效率低,亟需一种高效的自动筛查工具。为了满足自动筛查的需求,本文提出一种基于婴幼儿表情分析的孤独症谱系障碍自动筛查方法。方法 首先入组30例8~18个月的婴幼儿,包括10例ASD疑似患儿(HR-ASD)和20例正常发育婴幼儿;引入静止脸范式,并利用该范式诱发婴幼儿在社交压力条件下的情绪调节行为;提出一种面向婴幼儿视频表情识别的深度空时特征学习网络,首先在大规模公开数据集AffectNet预训练空域特征学习模型,然后在自建婴幼儿面部表情视频数据集RCLS&NBH+(Research Center of Learning Science&Nanjing Brain Hospital dataset+)上训练时空特征学习模型,从而建立一个较精准的婴幼儿表情识别模型;基于该模型深度特征序列的一阶统计量,构建婴幼儿社交压力环境下的表情行为症状与精神健康状态之间的关联,采用机器学习方法实现自动筛查。结果 1)基于婴幼儿表情人工标注的结果,发现:在1 min静止期,高危组的婴幼儿中性表情持续时长相对正常对照组偏高(p<0.01),而其他表情未发现有统计学意义的差异;2)提出的深度空时特征学习网络在本研究的30例婴幼儿面部表情视频数据集上的总体平均识别率达到了87.1%,3类表情预测结果与人工标注结果具有较高的一致性,其中Kappa一致性系数达到0.63,Pearson相关系数达到0.67;3)基于面部表情深度特征序列一阶统计量的精神健康状态预测性能达到灵敏度70%,特异性90%,分类正确率83.3%(置换检验p<0.05)。结论 本文提出的基于婴幼儿面部表情深度特征序列一阶统计量的精神健康状态自动预测模型是有效的,有助于实现高危孤独症谱系障碍自动筛查。
关键词
Automated early identification of ASD under the deep spatiotemporal model-based facial expression analysis for infants and toddlers

Tang Chuangao1,2, Zheng Wenming1,2, Zong Yuan1,2, Qiu Nana3, Yan Simeng1,2, Zhai Mengyao3, Ke Xiaoyan3(1.School of Biological Science&Medical Engineering, Southeast University, Nanjing 210096, China;2.Key Laboratory of Child Development and Learning Science, Ministry of Education, Southeast University, Nanjing 210096, China;3.Child Mental Health Research Center, Affiliated Brain Hospital of Nanjing Medical University, Nanjing 210029, China)

Abstract
Objective The traditional screening of high-risk autism spectrum disorder (HR-ASD) mainly relies on the evaluation of pediatric clinicians. Owing to the low efficiency of this method, highly efficient automated screening tools have become a major research topic. Although most facial expression markers-related research have achieved some progress, the findings were derived from comparatively older children, and the effectiveness of the paradigm used in these research has not been tested in babies aged 8~18 months, whose intelligence quotient (IQ), language, and social ability are still developing. In these studies, a lack of a final diagnostic model led to low feasibility values in the large-scale screening of ASD. In this study, a novel automated screening method that provide diagnostic results based on the analysis of babies' facial expression symptoms under social stress environments was proposed. Method Differences among the babies' facial expressions in the HR-ASD and typically developing (TD) comparison groups were determined. A total of 30 infants and toddlers were enrolled in our study, of which 10 were at risk of HR-ASD and 20 were TD babies. All the babies were 8~18 months at the time of enrollment, all the participants received a re-diagnosis during 25 months of life in order that each case in the two groups is true ASD or TD. The still-face paradigm, including an amusing mother-baby interaction episode (baseline, 2 min) and a still-face episode (1 min), was employed to induce babies' emotion regulation behaviors for social stress environment in subsequent episode. We hypothesized that facial features derived from an accurate facial expression recognition system can be used in distinguishing the two groups. This hypothesis was then verified. For the establishment of an accurate facial expression recognition system, a deep spatiotemporal feature learning network was proposed. The spatial feature learning module was pretrained on an open-access dataset named AffectNet and was further trained on a video-based baby facial expression dataset, Research Center of Learning Science & Nanjing Brain Hospital dataset+(RCLS&NBH+ 53 babies subjects, 101 videos, and 95 207 babies' facial images), and a bidirectional long-short term memory network (Bi-LSTM) was used. The trained deep spatiotemporal neural network was verified using the collected babies' facial expression dataset including 30 babies, that is, the infant emotion dataset (IED). Three types of learned features derived from deep neural networks, including feature_a (the output of the last fully connected layer with 1 024 units in a convolutional neural network (CNN) that was only trained on the AffectNet dataset), feature_b (feature_a's counterpart in the CNN part of the CNN+LSTM(long short term memory) model), and feature_c (the output containing 1 024 units derived from the Bi-LSTM module), were compared. Pearson's correlation was computed between these frame-level learned features and their corresponding frame-level facial expression labels. Feature subsets were selected using different correlation thresholds, including without threshold (1 024-d features), 0 < r < 1, 0.2 < r < 1, 0.4 < r < 1, and 0.6 < r < 1. Then, the use of first-order statistical measurement, that is, frame-level mean values of selected features within a video was proposed in exploring the association between babies' mental health status and their facial expression symptoms under social stress environment. Such features were fed to linear classifiers for the automated screening of HR-ASD. Result 1) Basing on the human coding for babies' facial expressions under the still-face paradigm, we find that babies at a high risk of ASD showed more neutral facial expressions (55.03±7.34 s) than those in the TD comparison group (46.26±11.02 s) during the one-minute still-face episode (p<0.01). The other two types of facial expressions (positive and negative facial expressions) did not show statistically significant differences; 2) The proposed deep spatiotemporal neural network achieved an overall average recognition accuracy of 87.1% on a self-collected infants and toddlers' facial expression dataset, IED, which included 30 babies in this study. The recall rates for positive, neutral, and negative facial expressions were 68.82%, 93.79%, and 59.57%, respectively, whereas the recall rates for positive, neutral and negative facial expressions corresponding to the spatial model just trained on AffectNet were 36.32%, 77.06%, and 58.42%, respectively. A high consistency between the automated emotion prediction results from the CNN+LSTM model and human coding results was found, that is, a Kappa coefficient of 0.63 and Pearson coefficient of 0.67 were attained. 3) Through a leave-one subject out cross-validation, a sensitivity of 70%, specificity of 90%, and overall diagnostic accuracy of 83.3%, where the p-value of permutation test was lower than 0.05, were achieved using the proposed automated screening model based on the linear discriminant classifier and proposed features (feature_c derived from the CNN+LSTM model) under the correlation threshold of 0.6 < r < 1, which also verified the proposed hypothesis, and a more accurate facial expression recognition model showed better diagnostic performance between HR-ASD and TD according to the comparison between feature_a and feature_c. Conclusion The automated screening method based on the proposed features from babies' facial expression is effective, showing potential for large-scale applications.
Keywords

订阅号|日报