Current Issue Cover
高效检测复杂场景的快速金字塔网络SPNet

李鑫泽, 张轩雄, 陈胜(上海理工大学光电信息与计算机工程学院, 上海 200093)

摘 要
目的 针对现今主流one-stage网络框架在检测高帧率视频中的复杂目标场景时,无法兼顾检测精度和检测效率的问题,本文改进one-stage网络架构,并使用光流跟踪特征图,提出一种高效检测复杂场景的快速金字塔网络(snap pyramid network,SPNet)。方法 调查特征提取网络以及金字塔网络内部,实现特征矩阵及卷积结构完全可视化,找到one-stage网络模型有效提升检测小目标以及密集目标的关键因素;构建复杂场景检测网络SPNet,由骨干网络(MainNet)内置子网络跟踪器(TrackNet)。在MainNet部分,设计特征权重控制(feature weight control, FWC)模块,改进基本单元(basic block),并设计MainNet的核心网络(BackBone)与特征金字塔网络(feature pyramid network, FPN)架构结合的多尺度金字塔结构,有效提升视频关键帧中存在的小而密集目标的检测精度和鲁棒性;在TrackNet部分,内置光流跟踪器到BackBone,使用高精度的光流矢量映射BackBone卷积出的特征图,代替传统的特征全卷积网络架构,有效提升检测效率。结果 SPNet能够兼顾小目标、密集目标的检测性能,在目标检测数据集MS COCO(Microsoft common objects in context)和PASCAL VOC上的平均精度为52.8%和75.96%,在MS COCO上的小目标平均精度为13.9%;在目标跟踪数据集VOT(visual object tracking)上的平均精度为42.1%,检测速度提高到5070帧/s。结论 本文快速金字塔结构目标检测框架,重构了one-stage检测网络的结构,利用光流充分复用卷积特征信息,侧重于复杂场景的检测能力与视频流的检测效率,实现了精度与速度的良好平衡。
关键词
Snap pyramid network: real-time complex scene detecting system

Li Xinze, Zhang Xuanxiong, Chen Sheng(School of Optical Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China)

Abstract
Objective With the great breakthrough of deep convolutional neural networks (CNNs), tremendous state-of-the-art networks have been created to significantly improve image classification and utilize modern CNN-based object detectors. Image classification and detection research has matured and entered the industrial stage. However, detecting objects in complex samples and high frame rate videos remains a challenging task in the field of computer vision, especially considering that samples and videos are filled with huge numbers of small and dense instances in each frame. The issues of state-of-the-art networks do not make a tradeoff between accuracy and efficiency for detecting small dense targets as the priority consideration. Thus, in this study, we propose a deep hybrid network, namely, snap pyramid network (SPNet). Method Our model is incorporated with dense optical flow technique in the enhanced one-stage architecture. First, the complete inner visualization of feature-extract net and pyramid net is built to mine out the critical factors of small dense objects. Through this method, the contextual information is found to be a significant key and should thus be fully utilized in the feature extraction. Moreover, sharing the context in multiple convolutional templates of the network is essential for the high semantic information from the deep templates, which help the shallow templates precisely predict the target location. Accordingly, our proposed hybrid net called SPNet is presented. It is composed of two parts, namely, MainNet and TrackNet. For the MainNet, the inception and feature weight control (FWC) modules are designed to modify the conventional network architecture. The whole MainNet consists of the BackBone network (the core of MainNet, which can efficiently extract feature information) and the feature pyramid network (FPN) network to predict classification and candidate box position. Inception greatly reduces the parameter quantity. FWC can raise the weight of essential features that help detect prospective targets while suppressing the features of non-target and other disturbances. To further accelerate the training speed, swish activation and group normalization are also employed to enhance SPNet. The training speed and validation accuracy of MainNet is better than YOLO (you only look once) and SSD (single shot multibox detector). As a result, the performance and robustness for small dense object detection can be substantially improved in key frames of the video. For the TrackNet, dense optical flow field technique is applied in the adjacent frame to obtain the FlowMap. Next, to substantially improve the detection efficiency, FlowMap is mapped with the FeatureMap through a pyramid structure instead of the traditional fully convolutional net. This method markedly shortens the time because optical flow calculation on GPU(graphics processing unit) is much faster than the feature extraction of convolutional network. Then, in the adjacent frame, only FPN, a light-type network, needs to be calculated. Result The model is trained and validated thoroughly by MS COCO(Microsoft common objects in context). Trained results demonstrate that the proposed SPNet can make a tradeoff between accuracy and efficiency for detecting small dense objects in a video stream, obtaining 52.8% accuracy on MS COCO, 75.96% on PASCAL VOC, and speeding up to 70 frame/s. Conclusion Experimental results show that SPNet can effectively detect small and tough targets with complexity. Multiplexing optical flow technology also greatly improves the detection efficiency and obtains good detection results. The investigation shows that improving the network performance is crucial to the research on the internal structure of the network and the exploration of its internal process. The performance is remarkably improved by the complete visualization of the network structure and the overall optimization of the network architecture.
Keywords

订阅号|日报