自然场景文本检测与识别的深度学习方法
摘 要
许多自然场景图像中都包含丰富的文本,它们对于场景理解有着重要的作用。随着移动互联网技术的飞速发展,许多新的应用场景都需要利用这些文本信息,例如招牌识别和自动驾驶等。因此,自然场景文本的分析与处理也越来越成为计算机视觉领域的研究热点之一,该任务主要包括文本检测与识别。传统的文本检测和识别方法依赖于人工设计的特征和规则,且模型设计复杂、效率低、泛化性能差。随着深度学习的发展,自然场景文本检测、自然场景文本识别以及端到端的自然场景文本检测与识别都取得了突破性的进展,其性能和效率都得到了显著提高。本文介绍了该领域相关的研究背景,对基于深度学习的自然场景文本检测、识别以及端到端自然场景文本检测与识别的方法进行整理分类、归纳和总结,阐述了各类方法的基本思想和优缺点。并针对隶属于不同类别下的方法,进一步论述和分析这些主要模型的算法流程、适用场景和技术发展路线。此外,列举说明了部分主流公开数据集,对比了各个模型方法在代表性数据集上的性能情况。最后总结了目前不同场景数据下的自然场景文本检测、识别及端到端自然场景文本检测与识别算法的局限性以及未来的挑战和发展趋势。
关键词
Deep learning methods for scene text detection and recognition
Liu Chongyu, Chen Xiaoxue, Luo Canjie, Jin Lianwen, Xue Yang, Liu Yuliang(School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510640, China) Abstract
With the rapid development of internet and mobile internet technologies, many new applications require extensive use of rich text information in natural scenarios, such as sign board recognition and automatic driving. Thus, the analysis and processing of scene text plays an essential role in this field and has increasingly become one of the research hotspots in the field of computer vision. Traditional text detection and recognition methods often rely on manually designed features, with large amount of computation and low efficiency. These methods also lack satisfactory generalization performance for complex scenes. With the development of deep learning in recent years, convolutional neural network has made great progress on scene text detection and recognition. These deep learning-based methods outperform traditional ones by a large margin and have already become the mainstream in the field of text reading in the wild. For scene text detection, the methods can be divided into two categories in accordance with the difference of target objects, as follows: top-down methods and bottom-up methods. Top-down methods mainly inherit the basic idea from general object detection or instance segmentation and directly regress the entire bounding box for the text instance. On the contrary, bottom-up methods, following the idea of traditional ones, initially detect some components of the text instance and then group them together through some rules. Bottom-up methods is more effective in processing text detection of arbitrary shapes and orientations than the top-down methods, and they are not as sensitive to text scaling as top-down methods. However, grouping the detected components into different text instances requires complex design and processing; thus, the inference stage of bottom-up approach becomes inefficient. These methods also encounter some difficulties when detecting long text. In addition, text conglutination occurs when detecting dense text. However, the top-down methods do not have this issue and can have a higher precision for text detection. In recent years, recognizing text in natural scenes (also known as scene text recognition (STR)) has aroused great interest in academia and industry. In particular, the objective of STR is to translate a cropped text instance image into a target string sequence. Although optical character recognition (OCR) in scanned documents has been well developed, STR remains challenging due to many factors (such as very complex backgrounds, various, fonts and imperfect imaging conditions). Early work has relied on hand-crafted features, such as histogram of oriented gradients descriptors, connected components, and stroke width transformation. However, the performance of these approaches is limited by the low capability of features. In recent years, with the increase and development of deep learning, the community has witnessed substantial advancements. In particular, scene text recognition approaches based on deep learning can be roughly divided into two branches, namely, segmentation-based approaches and segmentation-free approaches. Segmentation-based approaches attempt to locate the position of each character from the input text instance image, apply a character classifier to recognize each character, and then group characters into text lines to obtain the final recognition results. Segmentation-free approaches recognize the text instance image as a whole and focus on mapping the entire text instance image into a target string sequence directly. Both branches own their advantages and limitations. Therefore, practitioners should select the best trade-offs according to their needs under different application scenarios. In the previous decades, although the practicality and efficiency of recognition approaches have been significantly improved, future research is still required for generalization ability, evaluation protocols, and scenarios of STR. Finally, end-to-end scene text spotting aims to combine text detection and text recognition into a unified system, which can be optimized in a single pipeline. Bridging the gap between the detection branch and recognition branch is the most essential problem for the design of an end-to-end text spotting system. Similar to general object detection and instance segmentation, end-to-end text spotting methods can be divided into two categories, namely, two-stage methods and one-stage methods. Two-stage methods are mainly based on faster R-CNN(region convolutional neural network) and mask R-CNN, in which region of interest(RoI) pooling/align acts as a bridge between the two branches. However, these operations may lose some information given that the region proposals from region proposal network (RPN) are insufficiently accurate. One-stage methods follow the pipeline of detection then recognition. Various feature-align operations are carefully designed to boost the linking between detection and recognition branches. We sort out and summarize the detection and recognition methods of scene text, and further elaborate and analyze the basic ideas of various methods and their pros and cons. We aim to provide reference for researchers and help in future work.
Keywords
scene text detection scene text recognition(STR) end-to-end scene text spotting deep learning optical character recognition(OCR) survey
|