层级语义融合的场景文本检测

王紫霄; 谢洪涛; 王裕鑫; 张勇东

发布时间： 2023-08-17
摘要点击次数： 1203
全文下载次数： 878
DOI: 10.11834/jig.220902
2023 | Volume 28 | Number 8

层级语义融合的场景文本检测

王紫霄, 谢洪涛, 王裕鑫, 张勇东(中国科学技术大学信息科学技术学院, 合肥 230026)

摘要

目的场景文本检测是场景理解和文字识别领域的重要任务之一，尽管基于深度学习的算法显著提升了检测精度，但现有的方法由于对文字局部语义和文字实例间的全局语义的提取能力不足，导致缺乏文字多层语义的建模，从而检测精度不理想。针对此问题，提出了一种层级语义融合的场景文本检测算法。方法该方法包括基于文本片段的局部语义理解模块和基于文本实例的全局语义理解模块，以分别引导网络关注文字局部和文字实例间的多层级语义信息。首先，基于文本片段的局部语义理解模块根据相对位置将文本划分为多个片段，在细粒度优化目标的监督下增强网络对局部语义的感知能力。然后，基于文本实例的全局语义理解模块利用文本片段粗分割结果过滤背景区域并提取可靠的文字区域特征，进而通过注意力机制自适应地捕获任意形状文本的全局语义信息并得到最终分割结果。此外，为了降低边界区域的预测噪声对层级语义信息聚合的干扰，提出边界感知损失函数以降低边界区域特征的歧义性。结果算法在3个常用的场景文字检测数据集上实验并与其他算法进行了比较，所提方法在性能上获得了显著提升，在Totoal-Text数据集上，F值为87.0%，相比其他模型提升了1.0%；在MSRA-TD500（MSRA text detection 500 database）数据集上，F值为88.2%，相比其他模型提升了1.0%；在ICDAR 2015（International Conference on Document Analysis and Recognition）数据集上，F值为87.0%。结论提出的模型通过分别构建不同层级下的语义上下文和对歧义特征额外的惩罚解决了层级语义提取不充分的问题，获得了更高的检测精度。

关键词

场景文本文字检测全卷积网络（FCN）卷积神经网络（CNN）特征融合注意力机制

Hierarchical semantics-fused scene text detection

Wang Zixiao, Xie Hongtao, Wang Yuxin, Zhang Yongdong(School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China)

Abstract

Objective Scene-related text detection is essential for computer vision，which aims to localize text instances for targeted image. It is beneficial for such domain of text recognition applications like scene understanding，translation and text visual question answering. The emerging deep learning based convolution neural network（CNN）has been widely developing in relevance to text detection nowadays. Current researches are focused on texts location in terms of the regression of the quadrangular bounding box. However，since regression based methods unfit texts with arbitrary shapes（e. g. ， curved texts），many approaches focus on segmentation based methods. Fully convolutional networks（FCN）are commonly used to obtain high-resolution feature maps，and the pixel-level mask is predicted to locate the text instances as well. Due to the extreme aspect ratios and the various sizes of text instances，existing models are challenged for one feature maprelated integration of local-level and global-level semantics. More feature maps are introduced from multiple levels of the network，and hierarchical semantics can be generated from the corresponding feature map. But，these modules are required to yield the network to optimize the hierarchical features simultaneously，which may distract the network to a certain extent. Hence，existing networks are required to capture accurate hierarchical semantics further. Method To resolve this problem，the segmentation based text detection method is developed and a hierarchical semantic fusion network is demonstrated as well. We decouple the local and global feature extraction process and learn corresponding semantics. Specially，two mutual-benefited components are introduced for enhancing effective local and global feature，sub-region based local semantic understanding module（SLM）and instance based global semantic understanding module（IGM）. First， SLM is used to segment the text instance into a kernel and multiple sub-regions in terms of their text-inner position. And， SLM can be used to learn their segmentation，which is an auxiliary task for the model. As a small part of the text，segmenting sub-region requires more local-level information and less long-range context，for which the model can be yielded to learning more accurate local features. Furthermore，ground truth-supervised position information can harness the network to separate the adjacent text instances. Second，IGM is designed for global-contextual feature extraction through capturing text instances-amongst long-range dependency. Thanks to SLM-derived segmentation maps，IGM can be used to filter the noisy background easily，and the instance-level features of each text instance can be obtained as well. Those features are then fed into a Transformer to fuse the semantics from different instances，in which global receptive field-related text features can be generated. And，the similarity is calculated relevant to the original pixel-level feature map. Finally，the global-level feature is aggregated via similarity map-related text features. The integrated SLM and IGM are beneficial for its learning to segment the text from pixel to local region and to text instances further. During this procedure，the hierarchical semantics are collected in the corresponding module，which can shrink the distraction for the other related level manually. In addition，vague semantics-involved ambiguous boundary in segmentation results are be sorted out，which may distort the semantic extraction. To alleviate this problem，we illustrate location aware loss（LAL）to increase the aggression of the misclassification around the border region. The LAL is calculated in terms of a weighted loss，and a higher weight is assigned for the pixels closer to the boundary. This loss function can be used to get a confident and accurate prediction of the boundary-relevant model，which has more accurate and discriminative feature. Result Comparative analysis is carried out on the basis of 12 popular methods. Three sort of challenging datasets are used for a comprehensive evaluation as well， called Total-Text，MSRA-TD500，and ICDAR2015 for each. The quantitative evaluation metrics consists of F-measure， recall，and precision. We achieve over 1% improvement on these two datasets with the F-measure of 87. 0% and 88. 2%. Especially，the recall and precision on MSRA-TD500 can be reached to 92. 1% and 84. 5%. For the ICDAR2015 dataset， the precision is improved to 92. 3%. And，the F-measure on this dataset is optimized and reached to 87. 0%. Additionally， a series of comparative experiments on the Total-Text dataset are conducted to evaluate the effectiveness of each module proposed. Such analyses show that the proposed SLM，IGM，and LAL can be used to improve each F-measure of 1. 0%， 0. 6%，and 0. 5%. The qualitative visualization demonstrates that the baseline model can be optimized to a certain extent. Conclusion Hierarchical semantic understanding network is developed and a novel loss function is optimized for hierarchical semantics enhancement as well. Decoupling the local and global feature extraction process can be as an essential tool to get more accurate and reliable hierarchical semantics progressively.

Keywords

scene text text detection fully convolutional network（FCN） convolutional neural network（CNN） feature fusion attention mechanism

在线采编平台

论文出版

年度会议

下载中心

年度信息