Current Issue Cover
带后缀印刷体维吾尔文映射关系检索

伊克萨尼·普尔凯提1, 阿布都萨拉木·达吾提1, 艾斯卡尔·艾木都拉2(1.新疆大学软件学院, 乌鲁木齐 830046;2.新疆大学信息科学与工程学院, 乌鲁木齐 830046)

摘 要
目的 维吾尔文属于黏着性语言,其组成方式是在词干上添加词缀来实现不同的语义,在添加词缀的过程中词干的尾部会发生一定的形态变化,而且词干添加词缀的时候也可能会发生弱化、脱落、增音等音变现象导致进一步的形态变化,所以利用目前的图像文字检索(word spotting)技术只能检索到某一具体的维吾尔文词汇,却不能以某一词干为检索词,检索出其对应的带后缀的词语。为此,提出了基于映射关系的带后缀印刷体维吾尔文词语检索技术。方法 首先利用局部特征对维吾尔文词图像进行特征提取,其次将获得的特征用快速最近邻搜索(fast library for approximate nearest neighbors,FLANN)双向匹配来获得特征匹配集,最后将特征匹配集进行单应性变换和透视变换到待检索维吾尔文词图像上,把特征匹配集转化为空间关系,经过映射匹配对特征匹配集的空间关系进行后缀词检索,从而实现印刷体维吾尔文图像带后缀词检索的需求。结果 实验数据选取190幅维吾尔文印刷体文本图像中的17 648幅切割词图像,并对其中30幅词图像的167幅后缀词图像进行后缀检索,采用不同的局部特征算法进行后缀检索对比,结果表明,尺度不变特征变换(scale-invariant feature transform,SIFT)算法的后缀检索效果优于SURF(speeded up robust features)算法,精确率和召回率分别达到了94.23%和88.02%,在印刷体文档图像中,可以高效地检索到词干组成的后缀词,能够满足用户的不同检索需求,具有普适性。在弱化、脱落、增音和多种音变同时出现以及词干尾部发生变化的不同情况下进行后缀检索对比实验,实验结果表明在弱化和词干尾部变化而导致的形态变化中,检索效率最佳。结论 本文提出的基于映射关系进行后缀词图像检索的方法,是第一次对维吾尔文带后缀词检索方式的一种实现,利用匹配集之间的空间关系,对维吾尔文带后缀词图像实现了高效检索的目的。
关键词
Mapping relationship retrieval of Uyghur-printed suffix word

Eksan Firkat1, Abdusalam Dawut1, Askar Hamdulla2(1.School of Software, Xinjiang University, Urumqi 830046, China;2.School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China)

Abstract
Objective Uyghur belongs to adhesive language, and the formation and meaning of words in Uyghur language depend on affix connection, which add affixes to the stems to achieve different semantic meaning. For example, in Uyghur "مەكتەپنى" refers to school, and it could be added by a first person singular in Uyghur "ىر" as a suffix to form a new word "بەرسىڭىز",which means my school. During adding suffixes, certain morphological changes will occur at the tail of the stems. In addition, phonetic change also happens at that process, such as weakling (some morphological changes occur at the tail of a stem), epenthesis (a few characters were added at the tail of the stem), and deletion(a few characters were deleted at the tail of the stem). Multiple phonetic changes also appear simultaneously, causing further morphological change. All the semorphological changes form the word different from the stem. Therefore, using the current word spotting technique, only a specific Uyghur vocabulary can be retrieved, and a certain stem cannot search its corresponding suffixed words. In addition,traditional word spotting approaches only aim at the number of the matching sets and noton the spatial relationship of the matching sets. Therefore, some drawbacks for word spotting technique occur. This study proposes Uyghur-printed suffix word retrieval based on mapping relationship that take advantage of the spatial relationship of the matching sets to retrieve the corresponding suffix words of the stems. Method The process of the proposed approach is described as follows.First, the segmentation algorithm segments printed the Uyghur document images to word image corps. Then, the local features of the Uyghur word image are extracted. To compare the efficiency of the different local features, scale-invariant feature transform(SIFT) and speeded up robust features(SURF) have been adopted. The experiment result shows that SIFT feature has better performance than SURF because SIFT can obtain more feature points than SURF, and the distribution of SIFT feature points is more diverse than SURF, which is very helpful for further retrieval steps. However, SURF is more efficient than SIFT considering the time efficiency. Then, Brute-Force matching and fast library for approximate nearest neighbor(FLANN) bilateral feature matching have been adopted as matching algorithm. The experiment result shows that FLANN bilateral feature matching has better performance than Brute-Force matching because FLANN bilateral feature matching can filter more mismatch pairs than Brute-Force matching. In addition, the correctness of the feature matching set is very important to the following suffix word retrieval, and the accuracy of FLANN bilateral feature matching is very outstanding. Finally, the feature matching sets are subjected to homography transformation and perspective transformation to the Uyghur word image for the final retrieval steps. After homography transformation and perspective transformation, a quadrilateral is built. If this quadrilateral belongs to rectangle and the right part of the acquired rectangle simply match with the outline of the query word image, then the retrieved word belongs to corresponding suffixed words of the query stem.Meanwhile,if this quadrilateral does not belong to rectangle and does not match with the outline of the query word image, then the retrieved word does not belong to the corresponding suffixed words of the query stem but to the mismatch. This result indicates that the proposed method not only can retrieve corresponding suffixed words of the query word but also can filter the mismatch word. In other words, the feature matching sets are transformed into a spatial relationship and are further determined whether the retrieved word belongs to suffix word or mismatched word according to the spatial relationship. The spatial relationship of the feature matching set is searched for the suffix word, thereby implementing the printed Uyghur Text image suffix word retrieval. Result The experimental data selected 17 648 segmented word images in 190 Uyghur print text images and 30 word images, which have 167 corresponding suffix words considered as the search terms. In the experiment, we used different local feature algorithms to suffix retrieval. The comparison results show that the SIFT algorithm's suffix retrieval effect is better than SURF algorithm, and its accuracy and recall rate reach 94.23% and 88.02%, respectively. In addition, we carried out comparative experiments in different situations, such as weakling, epenthesis, deletion, and replacement. Moreover, multiple phonetic changes appear simultaneously and change in the tail of the word stem. In those five different situations, the retrieval result of changes in the tail of the word stem was the best, and its accuracy and recall rate reach 98.9% and 96.07%, respectively. The main reason for this result is that the changes in the tail of the word stem do not make very obvious formation changes of the stem word, which can be very helpful for the suffix word retrieval. However, in different cases, multiple phonetic changes appear simultaneously that the accuracy and recall rate of the retrieval reach 66.6% and 22.2%. The reason for such low performance is that multiple phonetic changes simultaneously change the formation of the stem part in the suffix word.In addition, several characters in the stem part have been replaced by other characters, which are very difficult to retrieve its corresponding suffix word by the original stem word. However,this kind of situation only takes few percentages of the whole morphological changes. Conclusion Therefore, the proposed algorithm meets the different retrieval needs of users. The method of suffix word retrieval based on mapping relationship is also the first implementation of Uyghur suffix word retrieval method. The Uyghur suffix image is efficiently retrieved by the spatial relationship between the matching sets.
Keywords

订阅号|日报