Current Issue Cover
  • 发布时间: 2024-09-04
  • 摘要点击次数:  30
  • 全文下载次数: 14
  • DOI:
  •  | Volume  | Number
ChinaMM2024会议推荐 多类型提示互补的弱监督时序动作定位

任小龙1, 张飞飞1, 周婉婷2, 周玲3(1.天津理工大学;2.北京邮电大学;3.澳门科技大学)

摘 要
目的 弱监督时序动作定位仅利用视频级标注来定位动作实例的起止时间并识别其类别。目前基于视觉语言的方法利用文本提示信息来提升时序动作定位模型的性能和鲁棒性。在视觉语言模型中,动作标签文本通常被封装为文本提示信息,按类型可分为手工类型提示(Handcrafted Prompts)和可学习类型提示(Learnable Prompts)。并且,不同类型的文本提示信息所能获取到有关动作类别的知识不同,而现有方法忽略了二者间的互补性,使得引入的文本提示信息无法充分发挥其引导作用并会带来一定的噪声信息,进而导致动作边界的定位不准确。为此,本文提出一种基于多类型提示互补的弱监督时序动作定位模型(Multi-type Prompts Complementary Model for Weakly-supervised Temporal Action Location)。方法 首先,设计提示交互模块,针对不同类型的文本提示信息分别与视频进行交互,并通过注意力加权,从而获得不同尺度的特征信息;其次,为了实现文本与视频对应关系的建模,本文利用一种片段级对比损失来约束文本提示信息与动作片段之间的匹配;最后,设计阈值筛选模块,将多个分类激活序列(Class Activation Sequence,CAS)中的得分进行筛选比较,以提高正确类别得分,降低错误类别得分,增强动作类别的区分性。结果 在三个具有代表性的数据集THUMOS-14、ActivityNet-1.2和ActivityNet-1.3上进行的大量实验验证了本文所提方法的有效性。其中,在THUMOS-14数据集不同mAP(0.1:0.7)实现了39.1%的平均性能,相比于2023年的P-MIL(Proposal-Based Multiple Instance Learning)平均提高了1.1%。在ActivityNet-1.2数据集中本文所提方法在mAP(0.5: 0.95)实现了27.3%的性能,与最先进的方法相比平均提升近1%。而ActivityNet-1.3数据集中mAP(0.5:0.95)取得了与同期工作相当的性能,平均mAP达到26.7%。结论 本文所提出的时序动作定位模型,利用两种类型文本提示信息的互补性来引导模型定位,提出的阈值筛选模块可以最大化利用两种类型文本提示信息的优势,最大化其辅助作用,使定位的结果更加准确。
关键词
Complementary multi-type prompts for weakly-supervised temporal action location

renxiaolong, zhangfeifei1, zhouwanting2, zhouling3(1.Tian jin University of technology;2.Beijing University of Posts and Telecommunications;3.Macau University of Science and Technology)

Abstract
Objective Weakly supervised temporal action localization uses only video-level annotations to locate the start and end times of action instances and identify their categories. Since only video-level annotations are available in weakly supervised environments, it is not possible to directly design a loss function for the task, so existing work generally adopts the strategy of “localization by classification” and utilizes multi-example learning for training. However, there are some limitations in this process: (1) Localization and classification are actually two different tasks, and there is an obvious gap between them, so localization based on classification results may affect the final performance. (2) In weakly supervised environments, there is a lack of fine-grained supervisory information to effectively distinguish between actions and backgrounds in videos, which poses a great challenge for localization. Recently, visual language models have received extensive attention. These models aim to model the correspondence between images and texts for more comprehensive visual perception. In order to better apply large models to downstream tasks, specific textual cues can improve the performance and robustness of the models. Currently, visual language-based approaches utilize auxiliary textual cueing information to compensate for supervisory information to improve the performance and robustness of temporal action localization models. In visual language models, action label text is usually encapsulated as textual prompts, which can be categorized into Handcrafted Prompts and Learnable Prompts. Handcrafted Prompts are composed of fixed templates and action labels, e.g., “a video of {class}”, which can learn a more generalized knowledge of the action class but lacks the specific knowledge of the relevant action, while Learnable Prompts are composed of a set of learnable vectors, which can be adjusted and optimized during the training process. and optimized during the training process, so the learnable type cues can learn more specific knowledge. Therefore, the two types of text cues can complement each other to improve the ability to discriminate different categories. However, the existing methods ignore the complementarity between the two, which makes the introduced text cues unable to give full play to its guiding role and bring certain noise information, which leads to inaccurate localization of action boundaries. Therefore, this paper proposes a Multi-type Prompts Complementary Model for Weakly-supervised Temporal Action Location, which maximizes the guiding effect of textual cues to improve the accuracy of action localization. The methodology of this paper is to improve the accuracy of action location by maximizing the guidance of textual cues. Method First of all, this paper designs a prompt interaction module from the complementarity of textual prompts, which matches manual type prompts and learnable type prompts with action segments to obtain different similarity matrices, and then mines the intrinsic connection between textual features and segment features through the attention layer and fuses them, so as to realize the interaction between different types of textual prompts. At the same time, text cues need to be better matched with action segments in order to play their guiding role. For this reason, this paper introduces the segment-level contrast loss, which is used to constrain the matching between cue messages and action segments. Finally, this paper designs a threshold filtering module to filter the Class Activation Sequence (CAS) obtained from the guidance of different types of textual cue messages according to the threshold value. Then, the final CAS is obtained by summing the CAS obtained after multi-layer filtering according to a specific scale parameter with the CAS obtained by using only video-level features, which covers the parts of each sequence with higher confidence scores, thus realizing the complementary advantages between different types of text cue information. Result Extensive experiments on three representative datasets, THUMOS-14, ActivityNet-1.2 and ActivityNet-1.3, validate the effectiveness of the method proposed in this paper. Among them, different mAP(0.1:0.5), mAP(0.1:0.7), mAP(0.3:0.7) on THUMOS-14 datasets achieved 58.2%, 39.1%, and 47.9% average performance, respectively, which is comparable to the 2023 P-MIL (Proposal-Based Multiple Instance Learning) average performance by up to 1.1%. In the ActivityNet-1.2 datasets, the method proposed in this paper achieves a performance of 27.3% at mAP(0.5: 0.95), which is an average improvement of nearly 1% compared to the state-of-the-art method. And the ActivityNet-1.3 datasets mAP(0.5: 0.95) achieves comparable performance to the same work, with an average mAP of 26.7%. In addition, a large number of ablation experiments were conducted on the THUMOS-14 datasets, and the experimental results proved the effectiveness of the modules. Conclusion In this paper, we propose a new weakly supervised temporal action localization model based on complementarity of multiple types of cues, which alleviates the problem of inaccurate localization boundaries by leveraging the complementarity between different types of textual cues. The cue interaction module is designed in a targeted way in order to realize the interaction of different types of textual cue information. In addition, this paper introduces clip-level contrast loss to realize the modeling of the correspondence between text cue information and video, so as to better constrain the matching between the introduced text cue information and action clips. Finally, this paper designs a multi-layer nested threshold screening module, which can make full use of the advantages of two different types of textual cue information, avoid the interference of noisy information on the model, and maximize the auxiliary role of textual information. Extensive experiments on two challenging datasets demonstrate the effectiveness of the method proposed in this paper.
Keywords

订阅号|日报