离线手写数学公式识别综述

朱建华; 高良才; 赵文祺; 彭帅; 胡鹏飞; 杜俊

发布时间： 2025-03-10
摘要点击次数： 113
全文下载次数： 34
DOI:
| Volume | Number

离线手写数学公式识别综述

朱建华¹, 高良才¹, 赵文祺¹, 彭帅¹, 胡鹏飞¹, 杜俊²(1.北京大学王选计算机研究所;2.中国科学技术大学语音及语言信息处理国家工程研究中心)

摘要

手写数学公式在教育和科技等领域具有广泛的应用，如何将其准确识别并转换成MathML或LaTeX等格式的结构化表达式即手写公式识别，成为文字识别领域一个备受关注的研究问题。同时，由于手写公式具有嵌套层次结构、书写风格多样等特点，这个研究问题仍极具挑战性。目前，手写公式识别的研究工作主要分为基于语法规则的传统方法和基于深度学习的方法。本文在系统回顾传统公式方法的识别流程与问题分析之后，重点梳理总结了基于深度学习的手写公式识别方法，围绕视觉特征提取、视觉与文本特征对齐和文本输出回归三个公式识别子任务，针对语义不变的视觉特征学习、“缺乏覆盖”、输出不平衡和建模公式二维结构四个问题，综述了过往工作进行的相关改进与优化。另外，针对当下热门的多模态大模型，在手写数学公式识别数据集上也对其进行了测试，并补充了其在印刷体公式识别中的表现。最后，结合手写公式识别目前面临的挑战和困难，对未来的发展方向和研究趋势进行了展望。

关键词

手写数学公式识别密集卷积网络注意力机制双向训练树结构

Offline handwritten mathematical expression recognition: a survey

Zhu Jianhua, Gao Liangcai¹, Zhao Wenqi¹, Peng Shuai¹, Hu Pengfei¹, Du Jun²(1.Wangxuan Institute of Computer Technology, Peking University;2.National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China)

Abstract

With the rapid development of online learning and communication, the application of handwritten mathematical expressions in fields such as education and technology is becoming increasingly widespread. The ability to accurately recognize handwritten mathematical expressions and convert them into structured formats like MathML or LaTeX has become a critical issue. However, due to the inherent challenges posed by the diverse handwriting styles and the complex two-dimensional structure of mathematical expressions, this problem remains highly challenging. Current research on handwritten mathematical expression recognition (HMER) is primarily divided into two categories: traditional methods and deep learning-based methods. Among these, deep learning-based methods have gained significant traction due to their ability to model the problem end-to-end using encoder-decoder architectures. These methods typically involve three key steps: feature extraction, alignment, and regression, each of which presents its own set of difficulties and challenges. Feature Extraction refers to the process by which the encoder extracts high-level semantic information from the input image. One of the primary challenges in this step is learning semantic invariant features. Due to differences in writing habits, the same symbol can be written in vastly different styles by different individuals. This stylistic variability can lead to significant challenges in recognizing symbols accurately. For instance, symbols like "x" and "X" or "C" and "c" can appear very similar in handwritten form, making it difficult for the model to distinguish between them. Additionally, the same symbol may vary in size depending on its position within the expression. For example, a comma or a period in a superscript or subscript position may be much smaller than the same symbol in the main body of the expression. Therefore, the vision encoder must learn to extract features that are invariant to these stylistic, size, and shape differences. This is crucial for ensuring that the model can accurately recognize symbols regardless of how they are written or where they appear in the expression. Alignment refers to the process by which the decoder gradually aligns the extracted visual features with the corresponding text features using attention mechanisms. One of the key challenges in this step is the "lack of coverage" problem. Ideally, the model should convert each text structure on the image into an appropriate text representation without repeating or omitting any parts. However, in practice, models often suffer from over-parsing (repeating parts of the expression) or under-parsing (omitting parts of the expression). This issue arises because the model lacks global alignment information, making it difficult to determine which parts of the image have already been processed. To address this, some models introduce coverage vectors, which keep track of the attention distribution over previous time steps, ensuring that the model does not repeatedly attend to the same region of the image. This helps to mitigate both over-parsing and under-parsing issues, improving the overall accuracy of the model. Regression refers to the process by which the decoder generates the output sequence in a self-regressive manner. This step involves two main challenges: the output imbalance problem and the modeling of two-dimensional structure relationships. The output imbalance problem arises because models typically predict the output sequence from left to right (L2R), which can lead to more accurate predictions for the prefix of the sequence compared to the suffix. This is because the model's ability to capture dependencies between symbols diminishes as the distance between them increases. To address this, some models employ bidirectional training strategies, where the model is trained to predict the sequence in both L2R and right-to-left (R2L) directions. This helps to balance the accuracy of the prefix and suffix predictions. Additionally, mathematical expressions often have a nested, two-dimensional structure, which can be challenging to model using traditional sequence-based approaches. Some models attempt to address this by incorporating tree-structured decoders, which explicitly model the hierarchical relationships between symbols. However, these methods often require more complex training procedures and may not always outperform sequence-based approaches. This article builds on the introduction of the above-mentioned "feature extraction," "alignment," and "regression" steps and the existing challenges and difficulties. It then reviews relevant improvements made in previous work on these challenges. For example, in the feature extraction step, models like DenseWAP and ABM have introduced multi-scale feature extraction techniques to better capture fine-grained details in the input image. In the alignment step, models like WAP and CoMER have introduced coverage vectors and attention refinement modules to address the lack of coverage problem. In the regression step, models like BTTR and ABM have employed bidirectional training strategies to improve the modeling of long-range dependencies. After discussing these improvements, this article supplements the experimental performance of some models on printed datasets. These experiments show that the main bottleneck of existing models is the diversity of writing strokes in handwritten expressions. While models perform well on printed datasets, where the style and size of symbols are consistent, their performance drops significantly on handwritten datasets due to the variability in writing styles. This highlights the need for further research into improving the robustness of models to different writing styles. In addition to discussing traditional and deep learning-based methods, this paper also conducts tests on handwritten mathematical expression recognition datasets using Large Multimodal Models (LMMs) such as GPT-4V and Qwen-VL-Max. These models, which are trained on vast amounts of multimodal data, have shown impressive performance in various vision and language tasks. However, their performance on HMER tasks is still suboptimal compared to specialized models. This is likely due to the lack of handwritten mathematical expression data in their training sets and the difficulty of capturing the complex two-dimensional structure of mathematical expressions. The results of these experiments suggest that while LMMs have potential in HMER, there is still significant room for improvement. Finally, this article points out future directions for development in the field of HMER. One promising direction is the integration of tree-structured prediction with sequence-based approaches. While tree-structured decoders offer a more interpretable way to model the hierarchical relationships in mathematical expressions, they often require more complex training procedures and may not always outperform sequence-based methods. Future research could explore ways to combine the strengths of both approaches, potentially by pre-training the decoder on large-scale LaTeX datasets and fine-tuning it on handwritten expression data. Another direction is the development of more robust data augmentation techniques to improve the model's ability to handle diverse writing styles. Additionally, the creation of larger and more diverse datasets, including multi-line expressions and mixed handwritten and printed text, could help to push the field forward. In conclusion, while significant progress has been made in the field of handwritten mathematical expression recognition, there are still many challenges to overcome. By addressing these challenges and exploring new directions, researchers can continue to improve the accuracy and robustness of HMER models, making them more useful in real-world applications such as education, document digitization, and scientific research.

Keywords

handwritten mathematical expression recognition, dense convolutional network, attention mechanism, bidirectional learing, tree structure

在线采编平台

论文出版

年度会议

下载中心

年度信息