FastFace:实时鲁棒的人脸检测算法
摘 要
目的 尽管基于深度神经网络的人脸检测器在检测精度上有了极大的提升,但其代价是必须依赖强大的计算资源。如何在CPU上取得较高的检测精度的同时达到实时的检测速度是一个巨大的挑战。针对非约束性条件下的快速鲁棒的人脸检测问题,提出一种基于轻量级神经网络的检测方法。方法 受轻量级网络MobileNet的启发,本文算法采用通道分离的卷积方式进行特征提取,并结合Inception和残差连接的思想,构建若干特征提取模块,最终训练出一个简单高效的特征提取网络;在检测时,采用One-Stage的检测策略,在骨干网络的若干不同层级上使用卷积的同时进行目标区域的分类和定位;在进行目标区域精调时,需要先在对应的特征层上预设先验框,然后再使用边界框回归算法调整先验框的位置和大小,使之接近真实框的位置。为了减少先验框的数量以节省模型参数,本算法针对人脸目标框的特点设置先验框。结果 基于TensorFlow深度学习库构建和训练本文的检测模型,在FDDB数据集上对其进行测试,并与若干经典算法对比了检测速度和精度。相较于多任务级联卷积网络(MTCNN)等典型的深度学习方法,本文算法在CPU上将检测速度提升到25帧/s,同时平均精度(mAP)保持在0.892,高于大多数传统算法。实验结果表明本文方法能实现在CPU上的实时、高精度检测。结论 提出了一种基于轻量级网络模型的人脸检测方法,以简单高效的卷积模块为基础构建骨干网络,并在检测时针对人脸比例特征设置合理的先验框。在非约束性条件以及有限计算资源条件下,该方法不仅在精度上表现良好,而且具有较快的检测速度,是一种鲁棒的检测方法。
关键词
FastFace: a real-time robust algorithm for face detection
Li Qiyun1,2, Ji Qingge1,2, Hong Saiding1,2(1.School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;2.Guangdong Province Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China) Abstract
Objective Face detection is a crucial step in various problems involving verification, identification, expression analysis. Although state-of-the-art convolutional neural network (CNN)-based face detectors exhibit improved detection accuracy, they are unsuitable to run on CPU devices because they are computationally prohibitive. Achieving high detection accuracy on CPUs and realizing real-time detection remain challenging. One of the reasons is that most back-bone networks in current face detection models are transferred from generic object detection networks. The models themselves are large and contain redundant information while modeling human faces. Moreover, the large search space of possible face locations and the variations of face sizes in one image require large computation for robust detection. Aiming at the fast and robust face detection problem under unconstrained conditions, this paper proposes a detection method based on a self-designed lightweight neural network. Method The instinct is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. Efforts have been made to design compact networks. Results proved that changing the direction of convolution can save parameters in neural networks. In this study, depth-wise separable convolution, which was first introduced in MobileNets, is used for feature extraction. We then combine the idea of inception and residual connection to construct several feature extraction modules, which finally consist of our backbone network. Unlike standard convolutions, depth-wise separable convolution uses depth-wise convolution followed by 1×1 point-wise convolution to implement convolution operation. When the kernel size is 3×3, depth-wise separable convolution uses 8 to 9 times less computation than standard convolution. Given that the inception modules and residual connections have become essential in new networks, we also use them in our model to enrich receptive field. In our backbone network, depth-wise separable convolution is used to extract features; residual connection and inception modules are introduced to feature extraction module to enrich receptive fields. We design our own bottleneck modules (with different strides), inception modules, and residual inception modules based on depth-wise separable convolution in contrast to existing convolutional modules. The modules are then concatenated to form a complete network model. Inception modules, which are composed of bottleneck modules in parallel, aims at rapidly reducing the size of the input image. As the name suggests, residual inception modules are inception modules with residual connections and can decrease the sizes of feature maps and enrich receptive fields. Detection is carried out on multiple feature layers to increase the robustness to scale variants of faces in input images. While detecting faces, One-Stage detection strategy is applied for fast face detection. We conduct detection at three different levels of feature maps in a single feed forward manner, that is, we simultaneously classify and regress object areas at above-mentioned feature maps by using convolutions. When fine tuning the exact locations of the object areas, we need to set priori boxes, namely, default anchors, on the corresponding feature layers, and then use the bounding box regression algorithm to adjust the location and size of the anchors to make them closer to the locations of the ground truth. To reduce the number of default anchors and save model parameters, we set the default anchors according to the priori knowledge of face box ratio. Result We conduct and train our detection model based on TensorFlow deep learning library. Our model is trained on the WIDER FACE dataset with several data augmentation tricks. We test our model on the Face Detection Dataset and Benchmark and compare its mean average precision (mAP) and detection speed with several classical algorithms. The proposed method achieves real-time and high-precision detection on the CPU. Compared with typical deep learning methods, such as multitask cascaded convolutional networks (MTCNN), our method exhibits detection speed that increases to 25 frames per second on CPUs and mAP maintained at 0.892, which is higher than those obtained using most traditional methods and reaches a relatively high precision level. Conclusion Face detectors based on deep learning exhibit improved detection accuracy. However, the high computational complexity of these methods leads to their very slow detection speed on CPUs. This paper presents a fast and robust face detection method based on lightweight neural network. A simple and efficient convolution neural network is constructed by depth-wise separable convolution, and the ideas of inception and residual connection are also used to keep the model lightweight and powerful. The default anchors are set according to the characteristics of the face boxes while applying one-stage detection strategy. Experiments demonstrate that the proposed method can significantly reduce redundant operation in the detection process. With a detection speed of 25 frames/s on CPUs, the face detection method is robust and not only performs well in terms of accuracy but also shows fast detection speed with limited computing resources under unconstrained conditions.
Keywords
computer vision face detection convolutional neural network (CNN) compact model One-Stage detection default anchor
|