线性分解注意力的边缘端高效Transformer跟踪

邱淼波; 高晋; 林述波; 李椋; 王刚; 胡卫明; 王以政

发布时间： 2024-07-16
摘要点击次数： 511
全文下载次数： 356
DOI: :10.11834/jig.240192
| Volume | Number

线性分解注意力的边缘端高效Transformer跟踪

邱淼波¹, 高晋¹, 林述波¹, 李椋², 王刚², 胡卫明¹, 王以政²(1.中国科学院自动化研究所;2.军事医学研究院军事认知与脑科学研究所)

摘要

目的将面向服务器端设计的跟踪算法迁移部署到边缘端能显著降低功耗,具有较高的实用价值。当前基于Transformer的跟踪算法具有明显的性能优势,然而部署在边缘端时,却可能产生较高的延迟。为了解决这个问题,提出了一种面向边缘端的线性分解注意力(Linearly Decomposed Attention,LinDA)结构,可有效降低Transformer的计算量和推理延迟。方法 LinDA将多头注意力近似表示成数据依赖部分和数据无关部分的和：对于数据依赖部分,它用简单的向量元素间相乘及求和表示,避免了复杂的转置和矩阵乘法；对于数据无关部分,它直接利用统计得到的注意力矩阵,然后加上一个可学习偏置向量。这种分解既具有全局注意力,又保持了数据依赖的优点。为了弥补线性分解带来的精度损失,还设计了一种知识蒸馏方案,它在原始的损失函数上增加了两部分蒸馏损失：一是将真实包围框替换成教师模型预测的包围框作为监督目标,称为硬标签知识蒸馏；二是将教师模型预测得分的相对大小作为监督目标,称为关系匹配知识蒸馏。基于LinDA结构进一步实现了一种面向边缘端的目标跟踪算法LinDATrack,并将其部署在国产边缘计算主机HS240上。结果在多个公开数据集上进行了评测。实验结果表明,该算法在该计算主机上可达到约62 FPS的跟踪速度,功耗约79.5瓦,功耗仅占服务器端的6.2%,同时其在LaSOT和LaSOT_ext上的跟踪精度(SUC)相对于服务器端基线算法SwinTrack-T最多仅下降约1.8%。结论 LinDATrack具有良好的速度和精度平衡,在边缘端具有较大的优势。

关键词

目标跟踪边缘端 Transformer 多头注意力知识蒸馏

Efficient transformer tracking for the edge end with linearly decomposed attention

(Institute of Automation, Chinese Academy of Sciences)

Abstract

Objective The transfer and deployment of tracking algorithms designed for server-end to edge-end has high practical value. This transformation leads to a remarkable decrease in energy consumption, particularly in situation where resources are limited. In recent years, tracking algorithms that incorporate the Transformer architecture have achieved considerable progresses for their superior performance. Nonetheless, the adaptation of these algorithms for edge computing often encounters difficulties, primarily due to the increased response time. This latency is attributed to the complex nature of the Transformer"s attention mechanism, which requires extensive computational resources. To address this issue, this paper proposes an innovative solution - the Linearly Decomposed Attention (LinDA) module, designed expressly for edge computing. By drastically lowering the computational demands and reducing the inference time of the Transformer, the LinDA module facilitates more effective and efficient tracking at the edge end. Method LinDA innovatively approximates the multi-head attention mechanism as two components: a data-dependent component and a data-independent component. For the data-dependent aspect, LinDA adopts a computationally economic approach. Rather than relying on the traditional, resource-intensive methods of transposition and matrix multiplication, it employs direct element-wise multiplication and addition of vectors. This method markedly reduces computational complexity, rendering it exceptionally well-suited for edge computing environments where resources are scarce. Regarding the data-independent facet, LinDA integrates a statistically derived attention matrix that encapsulates global contextual insights. This matrix is further refined with a learnable bias vector, enhancing the model"s adaptability and versatility. This decomposition strategy empowers LinDA to gain good precision and significantly boosts efficiency in devices constrained by limited resources. To mitigate potential compromises in accuracy due to the linear decomposition approach, this paper presents an advanced knowledge distillation strategy that plays a crucial role in bolstering the student model"s capabilities. This strategy encompasses two specialized distillation losses integrated into the baseline loss function, each meticulously designed to capture and convey critical insights from the teacher model to the student model. Firstly, the hard label knowledge distillation technique involves replacing the ground-truth bounding box with the bounding box predicted by the teacher model, serving as the supervision target for the student model. This method allows the student model to learn directly from the teacher"s discernment, thereby enhancing its predictive precision. Consequently, the student model captures the teacher"s knowledge of the problem, which enables it to yield more accurate predictions. Secondly, the relation matching knowledge distillation strategy harnesses the relationship between the teacher model"s predictions as the supervisory target. This innovative approach captures the complex relationships among different predictions, such as the relative significance of distinct objects or their spatial interrelations. By embedding this relational knowledge into the student model during its training, the model"s performance is further improved, rendering it more robust and powerful. In summary, this elaborate knowledge distillation framework successfully imparts the teacher model"s insights to the student model, effectively overcoming the potential precision degradation associated with the linear decomposition. It ensures that the student model inherits the teacher"s expertise, thereby enabling it to deliver more precise predictions and attain superior performance. This paper further implements an edge-end-oriented object tracking algorithm, LinDATrack, based on LinDA and distillation. It"s deployed on the domestic edge computing host HS240. Result We conduct thorough and comprehensive evaluation of the system across various public datasets to test its performance and capabilities. The experimental results validate the system"s outstanding tracking speed and good precision. Running on this computing host, LinDATrack achieves an impressive tracking speed of around 62 frames per second (FPS), facilitating efficient tracking in real-time settings. Furthermore, the system operates with a power consumption of approximately 79.5 watts, which represents just 6.2% of the energy used by server-end configurations. This dramatic reduction in energy usage underscores the system"s exceptional energy efficiency, positioning it as an ideal choice for deployment in settings with limited resources. In an era where energy conservation and sustainability are increasingly important, this system presents a compelling alternative to more energy-intensive options, contributing significantly to a more sustainable computing landscape. Beyond its remarkable tracking speed and low power consumption, the system also exhibits consistently high tracking accuracy, distinguishing it within the realm of object tracking. When compared to the server-end baseline algorithm, SwinTrack-T, the system"s tracking accuracy, as determined by the Success Rate (SUC) metric, shows only a slight decrease of about 1.8%. This minor drop in accuracy evidences the system"s capacity to balance performance with efficiency. It maintains precise tracking functionality while reducing resource usage, rendering it a versatile solution for a broad spectrum of tracking applications. Conclusion LinDATrack is distinguished by its exceptional balance of speed and accuracy, positioning it as a premier option for object tracking applications. Its performance is marked by efficiency, facilitating real-time tracking that users can rely on. Additionally, LinDATrack demonstrates considerable strengths when deployed at the edge, making it exceptionally well-suited for environments with limited resources. This combination of speed, accuracy, and edge-oriented advantages firmly establishes LinDATrack as a leading solution for edge-end tracking tasks.

Keywords

object tracking, edge end, transformer, multi-head attention, knowledge distillation

在线采编平台

论文出版

年度会议

下载中心

年度信息