全卷积语义分割与物体检测网络
肖锋1, 芮挺2, 任桐炜3, 王东2(1.陆军工程大学研究生院, 南京 210018;2.陆军工程大学野战工程学院, 南京 210018;3.南京大学计算机软件新技术国家重点实验室, 南京 210018) 摘 要
目的 目前主流物体检测算法需要预先划定默认框,通过对默认框的筛选剔除得到物体框。为了保证足够的召回率,就必须要预设足够密集和多尺度的默认框,这就导致了图像中各个区域被重复检测,造成了极大的计算浪费。提出一种不需要划定默认框,实现完全端到端深度学习语义分割及物体检测的多任务深度学习模型(FCDN),使得检测模型能够在保证精度的同时提高检测速度。方法 首先分析了被检测物体数量不可预知是目前主流物体检测算法需要预先划定默认框的原因,由于目前深度学习物体检测算法都是由图像分类模型拓展而来,被检测数量的无法预知导致无法设置检测模型的输出,为了保证召回率,必须要对足够密集和多尺度的默认框进行分类识别;物体检测任务需要物体的类别信息以实现对不同类物体的识别,也需要物体的边界信息以实现对各个物体的区分、定位;语义分割提取了丰富的物体类别信息,可以根据语义分割图识别物体的种类,同时采用语义分割的思想,设计模块提取图像中物体的边界关键点,结合语义分割图和边界关键点分布图,从而完成物体的识别和定位。结果 为了验证基于语义分割思想的物体检测方法的可行性,训练模型并在VOC(visual object classes)2007 test数据集上进行测试,与目前主流物体检测算法进行性能对比,结果表明,利用新模型可以同时实现语义分割和物体检测任务,在训练样本相同的条件下训练后,其物体检测精度优于经典的物体检测模型;在算法的运行速度上,相比于FCN,减少了8 ms,比较接近于YOLO(you only look once)等快速检测算法。结论 本文提出了一种新的物体检测思路,不再以图像分类为检测基础,不需要对预设的密集且多尺度的默认框进行分类识别;实验结果表明充分利用语义分割提取的丰富信息,根据语义分割图和边界关键点完成物体检测的方法是可行的,该方法避免了对图像的重复检测和计算浪费;同时通过减少语义分割预测的像素点数量来提高检测效率,并通过实验验证简化后的语义分割结果仍足够进行物体检测任务。
关键词
Full convolutional network for semantic segmentation and object detection
Xiao Feng1, Rui Ting2, Ren Tongwei3, Wang Dong2(1.School of Graduate, PLA Army Engineering University, Nanjing 210018, China;2.School of Field Works, PLA Army Engineering University, Nanjing 210018, China;3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210018, China) Abstract
Objective The mainstream object detection algorithm needs to delimit the default box in advance and then acquire the object box by filtering out the default box. Sufficiently dense and multi-scale default boxes must be preset to ensure a sufficient recall rate, which leads to repeated detection of various areas in an image and great computational waste. This study proposes a multi-task deep learning model (FCDN), which does not need to delimit the default boxes and can improve the detection speed while ensuring accuracy. Method The condition that the number of objects being detected is undetermined is the reason the current mainstream object detection algorithm needs to delineate the default box in advance. Deep learning object detection networks are developed by image classification models. Consequently, the number of objects to be detected is unpredictable, and the output of the detection model cannot be determined. Sufficiently dense and multiscale default boxes must be classified or recognized to ensure the recall rate. The object detection task requires object category information to realize the recognition of different objects and object boundary information to realize the positioning of each object. A semantic segmentation map extracts rich category information of objects, which can be used to recognize the categories of the objects. Object recognition and positioning can be completed by adopting the idea of semantic segmentation, designing a module to extract the boundary key points of the objects, and combining semantic segmentation map and the boundary key points of the objects. Object detection methods based on image classification have a rectangular receptive field that contains the information of other objects or background other than the object itself. Object detection methods based on semantic segmentation map and boundary key points are different; their receptive field is at the pixel level. Pixels of detected object can be removed from the semantic segmentation map and boundary key point distribution map, which does not affect the other object detection and can avoid the residual of small objects. According to the preceding analysis, we propose a new multi-task learning model, which increases the prediction layer of boundary key points on the basis of a semantic segmentation model, can complete the semantic segmentation and boundary key point prediction at the same time, and combines the semantic segmentation map and boundary key point distribution map to complete object detection. Boundary lines are obtained through boundary key points and object boxes according to the boundary lines. Result An object detection network that does not need to delimit the default boxes is proposed. This object detection algorithm is no longer based on image classification but uses the semantic segmentation idea to detect all object boundary key points at the pixel level. The ground truth box is obtained by combining the category information of the semantic segmentation result. The object detection method is trained based on semantic segmentation and then tested with PASCAL VOC 2007 test image data sets to verify its feasibility. The performance comparison results with the current mainstream object detection algorithm show that the semantic segmentation and object can be realized at the same time by using the new model trained with the same training sample. The detection precision of FCDN is superior to that of classic detection models. In terms of the running speed of the algorithm, compared with FCN, it is reduced by 8 ms, which is close to fast detection algorithms, such as YOLO. Conclusion This study proposes an idea of object detection that is no longer based on image classification and it utilizes semantic segmentation to extract information from the image to be detected. Experimental results show that according to the semantic segmentation image and the boundary point to complete, the object detection method is feasible. This method can avoid repeated detection and reduced waste calculation by decreasing the pixels of semantic segmentation prediction to improve detection efficiency. The simplified semantic segmentation map will not affect the detection accuracy.
Keywords
deep learning object detection semantic segmentation object boundary key points multi-task learning transfer learning default boxes
|