SCID:用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集
摘 要
目的 视觉富文档信息抽取致力于将输入文档图像中的关键文字信息进行结构化提取,以解决实际业务问题,财务票据是其中一种常见的数据类型。解决该类问题通常需要应用光学字符识别(optical character recognition,OCR)和信息抽取等多个领域的技术。然而,目前公开的相关数据集的数量较少,且每个数据集中包含的图像数量也较少,这都成为了制约该领域技术发展的一个重要因素。为此,本文收集、标注并公开发布了一个真实中文扫描票据数据集SCID(scanned Chinese invoice dataset),包含6类常见财务票据,共40 716幅图像。方法 该数据集提供了用于OCR任务和信息抽取的两种标签。针对该数据集,本文提出一个基于LayoutLM v2(layout languagemodel v2)的基线方案,实现了从输入图像到最终结果的端到端推理。基于该数据集承办的CSIG(China Society ofImage and Graphics)2022票据识别与分析挑战赛,吸引了大量科研人员参与,并提出了优秀的解决方案。结果 在基线方案实验中,分别验证了使用OCR引擎推理、OCR模型精调和OCR真值3种设定的实验结果,F1值分别为0.768 7、0.857 0和0.985 7,一方面证明了LayoutLM v2模型的有效性;另一方面证明了该场景下OCR的挑战性。结论 本文提出的扫描票据数据集SCID展示了真实OCR技术应用场景的多项挑战,可以为文档富视觉信息抽取相关技术领域研发和技术落地提供重要数据支持。该数据集下载网址:https://davar-lab.github.io/dataset/scid.html。
关键词
SCID:a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images
Qiao Liang1,2, Li Zaisheng2, Cheng Zhanzhan2, Li Xi1(1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310013, China;2.Hikvision Research Insititue, Hangzhou 310051, China) Abstract
Objective Visually-rich document information extraction is committed to such key document images-related text information structure. Invoice-contextual data can be as one of the commonly-used data types of documents. For the enterprises-oriented reimbursement process,much more demands are required of key information extraction of invoices. To resolve this problem,such key techniques like optical character recognition(OCR)and information extraction have been developing intensively. However,the number of related publicly available datasets and the number of images involved are relatively challenged to rich in each dataset. Method We develop a real financial scanned Chinese invoice dataset,for which it can be used for collection,annotation,and releasing further. This data set consists of 40 716 images of six types of invoices in the context of aircraft itinerary tickets,taxi invoices,general quota invoices,passenger invoices,train tickets, and toll invoices. It can be divided into training/validation/testing sets further in related to 19 999/10 358/10 359 images. The labeling process of this dataset is concerned of such key steps like pseudo-label generation,manual recheck and cleaning,and manual desensitization,which can offer two sort of labels-related for the OCR task and information extraction deliberately. Such of challenges are still to be resolved in the context of print misalignment,blurring,and overlap. We facilitate a baseline scheme to realize end-to-end inference result. The overall solution can be divided into four steps as mentioned below:1)a OCR module to predict all text instances’content and location. 2)A text block ordering module to re-arrange all text instances into a more feasible order and serialize the 2D information into 1D. 3)The LayoutLM v2 model is melted into three modalities information(text,visual,and layout)and generate the prediction of sequence labels,which can utilize knowledge generated from the pre-trained language model. 4)The post-processing module transfer the model’s output to the final structural information. The overall solution can simplify the complexity of the overall ticket system via the integration of multiple invoices. Result The baseline experimental results are verified using OCR engine reasoning, OCR model prediction,and OCR ground-truth value. The F1 value of 0. 768 7/0. 857 0/0. 985 7 can be reached as well. Furthermore,the effectiveness of the overall solution and LayoutLM V2 model can be optimized,and the challenging issue of OCR can be reflected in this scenario. Tesla-V100 GPU-based inference speed of the model can be reached to 1. 88 frame/s. The accuracy of 90% can be reached using the raw image only as input. We demonstrate that the optimal solutions can be roughly segmented into two categories:one category is focused on melting the structured task into the text detection straightforward(i. e. ,multi-category detection),and the requirement of recognition model is to identify the text only with the corresponding category of concern. The other one is to implement the general information strategy,and an independent information extraction model can be used to extract key information. These solutions can integrate the potentials of the OCR and information extraction technologies farther. Conclusion The scanned invoice dataset SCID(scanned Chinese invoice dataset)proposed demonstrates the application scenarios of the OCR technology can provide data support for the research and development of visually-rich document information extraction-related technology and technical implementation. The dataset can be linked and downloaded from https://davar-lab. github. io/dataset/scid. html.
Keywords
dataset financial invoices visually-rich documents information extraction optical character recognition (OCR) multi-modal information
|