Current Issue Cover
文本页面图像的图文分割与分类算法

王加俊1, 黄贤武1, 郭玮玮1, 仲兴荣1(苏州大学电子信息学院,苏州 215021)

摘 要
为了能对包含不规则图片区和表格的倾斜文本页面图像进行图文分割与分类,提出了一种新的图文分割和分类算法。该算法先采用数学形态学和分级霍夫变换来进行文本倾斜的检测和校正;然后为了使算法能够对包含不规则图片区的文本页面图像进行处理,提出在传统的投影轮廓切割算法中,引入中点切割的过程,以便利用一系列的矩形来近似地逼近不规则的图片区。对于分割后的图像,则提出利用黑白像素比(Rbw)和近邻像素间的交叉相关性(Rcc)两个特征来作为分类的判据。实验结果证明,算法速度快、可靠性高。该算法只适用于二值图像。
关键词
Page Segmentation and Classification Algorithm for Document Images

()

Abstract
In this paper, a system valid of the segmentation and classification of skewed document images with irregular graph regions and form regions is proposed. In this system, the skew angle of the document images is detected with a novel algorithm based on the morphological operation of Hit-or-Miss and the hierarchical Hough transform. The former(Hit-or-Miss operation) is for the detection of the baseline points while the latter(Hough transform) is for the detection of the skew angle of the baseline which is also of the page image. To make the system valid for the document images with irregular graph regions involved, we proposed to introduce a middle point cut process to the traditional projection profile cut algorithm so that the irregular graph regions can be approximated with a lot of small rectangles. The segmented regions are classified with two features of the black to white ratio and the cross correlation between adjacent pixels of the sub-blocks. Experimental results have proved the fastness and the reliability of the system proposed in this paper.
Keywords

订阅号|日报