Current Issue Cover
像素聚合和特征增强的任意形状场景文本检测

师广琛, 巫义锐(河海大学计算机与信息学院, 南京 211100)

摘 要
目的 获取场景图像中的文本信息对理解场景内容具有重要意义,而文本检测是文本识别、理解的基础。为了解决场景文本识别中文字定位不准确的问题,本文提出了一种高效的任意形状文本检测器:非局部像素聚合网络。方法 该方法使用特征金字塔增强模块和特征融合模块进行轻量级特征提取,保证了速度优势;同时引入非局部操作以增强骨干网络的特征提取能力,使其检测准确性得以提高。非局部操作是一种注意力机制,能捕捉到文本像素之间的内在关系。此外,本文设计了一种特征向量融合模块,用于融合不同尺度的特征图,使尺度多变的场景文本实例的特征表达得到增强。结果 本文方法在3个场景文本数据集上与其他方法进行了比较,在速度和准确度上均表现突出。在ICDAR(International Conference on Document Analysis and Recognition) 2015数据集上,本文方法比最优方法的F值提高了0.9%,检测速度达到了23.1 帧/s;在CTW(Curve Text in the Wild) 1500数据集上,本文方法比最优方法的F值提高了1.2%,检测速度达到了71.8 帧/s;在Total-Text数据集上,本文方法比最优方法的F值提高了1.3%,检测速度达到了34.3 帧/s,远远超出其他方法。结论 本文方法兼顾了准确性和实时性,在准确度和速度上均达到较高水平。
关键词
Arbitrary shape scene-text detection based on pixel aggregation and feature enhancement

Shi Guangchen, Wu Yirui(School of Computer and Information, Hohai University, Nanjing 211100, China)

Abstract
Objective Text can be seen everywhere, such as on street signs, billboards, newspapers, and other items. The text on these items expresses the information they intend to convey. The ability of text detection determines the level of text recognition and understanding of the scene. With the rapid development of modern technologies such as computer vision and internet of things, many emerging application scenarios need to extract text information from images. In recent years, some new methods for detecting scene text have been proposed. However, many of these methods are slow in detection because of the complexity of the large post-processing methods of the model, which limits their actual deployment. On the other hand, the previous high-efficiency text detectors mainly used quadrilateral bounding boxes for prediction, and accurately predicting arbitrary-shaped scenes is difficult. Method In this paper, an efficient arbitrary shape text detector called non-local pixel aggregation network (non-local PAN) is proposed. Non-local PAN follows a segmentation-based method to detect scene text instances. To increase the detection speed, the backbone network must be a lightweight network. However, the presentation capabilities of lightweight backbone networks are usually weak. Therefore, a non-local module is added to the backbone network to enhance its ability to extract features. Resnet-18 is used as the backbone network of non-local PAN, and non-local modules are embedded before the last residual block of the third layer. In addition, a feature-vector fusion module is designed to fuse feature vectors of different levels to enhance the feature expression of scene texts of different scales. The feature-vector fusion module is formed by concatenating multiple feature-vector fusion blocks. Causal convolution is the core component of the feature-vector fusion block. After training, the method can predict the fused feature vector based on the previously input feature vector. This study also uses a lightweight segmentation head that can effectively process features with a small computational cost. The segmentation head contains two key modules, namely, feature pyramid enhancement module (FPEM) and feature fusion module (FFM). FPEM is cascadable and has a low computational cost. It can be attached behind the backbone network to deepen the characteristics of different scales and make the network more expressive. Then, FFM merges the features generated by FPEM at different depths into the final features for segmentation. Non-local PAN uses the predicted text area to describe the complete shape of the text instance and predicts the core of the text to distinguish various text instances. The network also predicts the similarity vector of each text pixel to guide each pixel to the correct core. Result This method is compared with other methods on three scene-text datasets, and it has outstanding performance in speed and accuracy. On the International Conference on Document Analysis and Recognition(ICDAR) 2015 dataset, the F value of this method is 0.9% higher than that of the best method, and the detection speed reaches 23.1 frame/s. On the Curve Text in the Wild(CTW) 1500 dataset, the F value of this method is 1.2% higher than that of the best method, and the detection speed reaches 71.8 frame/s. On the total-text dataset, the F value of this method is 1.3% higher than that of the best method, and the detection speed reaches 34.3 frame/s, which is far beyond the result of other methods. In addition, we design parameter setting experiments to explore the best location for non-local module embedding. Experiments have proved that the effect of embedding the non-local module is better than non-embedding, indicating that non-local modules play an active role in the detection process. According to the detection accuracy, the effect of embedding non-local blocks into the second, third, and fourth layers of ResNet-18 is significant, while the effect of embedding the fifth layer is not obvious. Among the methods, embedding non-local blocks in the third layer has the best effect. We designed ablation experiments on the ICDAR 2015 dataset for the non-local and feature-vector fusion modules. The experimental results prove that the superiority of the non-local module does not come from deepening the network but from its own structural characteristics. The feature vector fusion module also plays an active role in the scene text-detection process, which combines feature maps of different scales to enhance the feature expression of scene texts with variable scales. Conclusion In this paper, an efficient text detection method for arbitrary shape scene is proposed, which considers accuracy and realtime. The experimental results show that the performance of our model is better than that of previous methods, and our model is superior in accuracy and speed.
Keywords

订阅号|日报