Current Issue Cover
面向平面扫描图像的用户定制意图理解智能体

冯弋珂1, 励雪巍2, 刘鹏伟3, 郭丰俊3, 龙腾3, 李玺2(1.浙江大学软件学院;2.浙江大学计算机科学与技术学院;3.上海合合信息科技股份有限公司)

摘 要
目的 移动端应用中对平面扫描图像的用户意图理解是常见的现实需求,传统方法主要是利用大量用户历史行为数据进行建模,预测用户对新图像的可能意图,但该应用场景面临着定制化问题、交互次数少等挑战,限制了传统方法的效果。而近几年出现的智能体方法可以较好地应对这些挑战,为定制意图理解任务提供了新的思路。基于此,本文提出了一个面向平面扫描图像的用户定制意图理解智能体。方法 智能体由任务感知、任务规划、任务执行与反馈校正模块构成,并针对方法面临的小样本增量问题以及计算资源有限、基准数据集不足等技术挑战,首先提出了“分而治之”的域泛化方法,将基任务与定制化任务的推理解耦使其互不影响。其次通过模板匹配进行意图理解,以实现无需微调即可应对新的定制化任务的功能。然后通过自提升策略减少意图理解结果噪声,提升域泛化的可靠性。此外还构建了平面扫描图像的定制意图理解基准数据集。结果 本文智能体在本文提出的基准数据集上与其他7种方法进行了比较,在平均交并比(mean intersection over union, mIoU)指标上,智能体的mIoU达90.47%,相比于性能第二的方法提高了15.60%,总正确率提高了22.10%。同时进行了消融实验,验证了智能体各部分的有效性。最后将智能体应用在公开票据数据集CORD(COnsolidated Receipt Dataset)上,验证了智能体的泛化能力。结论 本文提出的智能体超越了前沿检测和分割模型在平面扫描图像的定制意图理解任务上的表现,同时回避了对每个子任务微调模型的过程,方法具有有效性和高效性。
关键词
User-customized intention understanding agent for planar scanned images

(1.College of Computer Science and Technology, Zhejiang University;2.College of Computer Science and Technology,Zhejiang University)

Abstract
Objective In the era of mobile internet, mobile applications are developing rapidly and becoming increasingly common in society, which have become an indispensable part in daily life. The demands and expectations of users for mobile applications are also constantly increasing. In the development of mobile applications, user intention understanding is an important research field, which aims to provide more personalized and intelligent services for users by analyzing their behavior and needs. User intention understanding for image input in mobile applications is a common practical requirement. In image-related human-computer interaction, users often need to interact with mobile applications through touch clicks or gestures, so that mobile applications can intelligently understand users’ intentions. Traditional intention understanding methods mainly use a large amount of historical user behavior data to model and predict the possible user intentions for new images. However, the application scenario is troubled by the challenges of the customization problem and few interactions, etc, which limit the effectiveness of traditional methods. In recent years, with the development of autonomous artificial intelligence, the emergence of agent technology has provided new perspectives for user-customized intention understanding task. Agents can imitate and learn human thinking process, accurately understand users, reduce the burden of memory and operation, and help mobile applications better understand users’ intentions and needs. Therefore, we propose to build a user-customized intention understanding agent for planar scanned images for user-customized intention understanding task. Method The user-customized intention understanding agent consists of task perception, task planning, task execution and feedback correction modules. Task perception module extracts information from an input image and combines it with user-customized template information obtained from stored intention libraries to understand the intention of the input image. Task planning module utilizes the results of the perception module to plan and make decisions on input image. Task execution module follows the decision made in the planning module to execute the corresponding action and output the intention understanding result of the input image. If the user is not satisfied with the current result, user can provide feedback through feedback correction module, and then the agent will output a new result, use the corrected image as a new user-customized intention template image and add it to the intention libraries. To address the technical challenges faced in building these modules such as few-shot incremental learning problem, limited computing resources and insufficient benchmark datasets, we propose several new technological ideas. Firstly, a “divide and conquer” domain generalization method is proposed to decouple the inferences of basic tasks and customized tasks so that they do not affect each other. Secondly, we achieve intention understanding through template matching to cope with new customized samples without fine-tuning, which solves the few-shot incremental learning problem. Then, a self-improvement strategy is used to reduce the noise of intention understanding results and improve the reliability of domain generalization. In addition, we construct a customized intention understanding benchmark for planar scanned images, providing a data foundation for this study and subsequent research. Result We conduct a series of ablation experiments on our dataset to clearly show the effectiveness of different parts of our agent. At the same time, we compare our agent with 7 state-of-the-art methods on our dataset, including traditional saliency detection models, model fine-tuning, and visual-language models. The mIoU(mean intersection over union) of the agent on our dataset reaches 90.47%, which is 15.60% higher than the second best performing method, and the total accuracy is also improved by 22.10%. Finally, the agent is applied to a public dataset called CORD(COnsolidated Receipt Dataset) to verify its generalization ability. Conclusion In this study, we propose a user-customized intention understanding agent for planar scanned images. The experimental results demonstrate that our proposed “divide and conquer” domain generalization method, self-improvement strategy, and object detection parameters all have a certain promoting effect on intention understanding. In addition, The agent outperforms the cutting-edge detection and segmentation models on the user intention understanding task without fine-tuning, proving the effectiveness and efficiency of our method.
Keywords

订阅号|日报