Current Issue Cover
多视角神经网络非接触式脉搏信号提取

赵昶辰, 居峰, 冯远静(浙江工业大学信息工程学院, 杭州 310023)

摘 要
目的 远程光体积描记(remote photoplethysmography,rPPG)是一种基于视频的非接触式心率测量技术,受到学者的广泛关注。从视频数据中提取脉搏信号需要同时考虑时间和空间信息,然而现有方法往往将空间处理与时间处理割裂开,从而造成建模不准确、测量精度不高等问题。本文提出一种基于多视角2维卷积的神经网络模型,对帧内和帧间相关性进行建模,从而提高测量精度。方法 所提网络包括普通2维卷积块和多视角卷积块。普通2维卷积块将输入数据在空间维度做初步抽象。多视角卷积块包括3个通道,分别从输入数据的高—宽、高—时间、宽—时间3个视角进行2维卷积操作,再将3个视角的互补时空特征进行融合得到最终的脉搏信号。所提多视角2维卷积是对传统单视角2维卷积网络在时间维度的扩展。该方法不破坏视频原有结构,通过3个视角的卷积操作挖掘时空互补特征,从而提高脉搏测量精度。结果 在公共数据集PURE(pulse rate detection dataset)和自建数据集Self-rPPG(self-built rPPG dataset)上的实验结果表明,所提网络提取脉搏信号的信噪比相比于传统方法在两个数据集上分别提高了3.92 dB和1.92 dB,平均绝对误差分别降低了3.81 bpm和2.91 bpm;信噪比相比于单视角网络分别提高了2.93 dB和3.20 dB,平均绝对误差分别降低了2.20 bpm和3.61 bpm。结论 所提网络能够在复杂环境中以较高精度估计出受试者的脉搏信号,表明了多视角2维卷积在rPPG脉搏提取的有效性。与基于单视角2维神经网络的rPPG算法相比,本文方法提取的脉搏信号噪声、低频分量更少,泛化能力更强。
关键词
Noncontact pulse signal extraction based on multiview neural network

Zhao Changchen, Ju Feng, Feng Yuanjing(College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China)

Abstract
Objective Remote photoplethysmography (rPPG) has recently attracted considerable research attention due to its capability of measuring blood volume pulse from video recordings using computer vision techniques without any physical contact with the subject. Extracting pulse signals from video data requires the simultaneous consideration of both spatial and temporal information. However, such signals commonly processed separately using different methods can result in inaccurate modeling and low measurement accuracy. A multiview 2D convolutional neural network for pulse extraction from video is proposed to model the intra- and interframe correlation of video data from three points of view. This study aims to investigate the effective spatiotemporal modeling method for rPPG and improve the pulse measurement accuracy. Method The proposed network contains three pathways. The network performs 2D convolution operations in a given video segment from three perspectives of input data, namely, height-width, height-time, and width-time, and then integrates complementary spatiotemporal features of the three perspectives to obtain the final pulse signal, which is called multiview heart rate network (MVHRNet). MVHRNet consists of two normal (H-W) convolutional blocks and three multiview (height-width, height-time, width-time (H-W, H-T, and W-T)) 2D convolutional blocks. Each convolutional block (except the last block) includes dropout, convolutional, pooling, and batch normalization layers. The input and output of the network are a video clip and a predicted pulse signal, respectively. Multiview 2D convolution is a natural generalization of the single-view 2D convolution to all the three viewpoints of volumetric data. The normal 2D convolution in H-W view is taken as an example. H-W filters go through one image from left to right, top to bottom, move to next frame (slice), and repeat the process. Filters can learn the spatial correlation within each slice in H-W view in this manner. Similarly, the same process is performed on each slice in H-T view to ensure that filters can learn the correlation within H-T view, which is the partial temporal information of the video clip. The convolution in W-T view can learn the temporal information within W-T. Compared with rPPG methods, the proposed method simultaneously models the spatiotemporal information, preserves the original structure of the video, and exploits the complimentary spatiotemporal features by performing a three-view 2D convolution. Result Extensive experiments on two datasets (one public dataset pulse rate detection dataset(PURE) and one self-built dataset self-built rPPG dataset(Self-rPPG)) are conducted, including an ablation study, comparison experiments, and cross-data set testing. Experimental results showed that the signal-to-noise ratio (SNR) of the extracted signal via the proposed network is 3.92 dB and 1.92 dB higher than that of the signal extracted using traditional methods on two datasets, respectively, and 2.93 dB and 3.2 dB higher than the single-view network on two datasets, respectively. We also evaluate the impact of the window length of the input video clip on the quality of the extracted signal. The results showed that the SNR of the extracted signal increases and the mean absolute error (MAE) decreases as the window length of the input video clip increases. SNR and MAE tend to saturate when T is greater than 120. The experimental results showed that the training times of multiview and single-view networks have the same order of magnitude. Conclusion Spatiotemporal correlation in videos can be effectively modeled using multiview 2D convolution. Compared with traditional rPPG methods (plane-orthogonal-to-skin (POS) and chrominance-based methods (CHROM)), SNR of pulse signals extracted via the proposed method using two datasets increases by 52.9% and 42.3%. Compared with the rPPG algorithm based on the single-view 2D convolutional neural network (CNN), the proposed network can extract pulse signals with less noise, fewer low-frequency components, stronger generalization ability, and nearly equal computational cost. This study demonstrates the effectiveness of multiview 2D CNN in rPPG pulse extraction. Hence, the proposed network outperforms existing methods in extracting pulse signals of subjects in complex environments.
Keywords

订阅号|日报