[1]刘歆,曾奎,陈奉.基于多查询token选择机制的Transformer行为识别模型[J].智能系统学报,2026,21(2):410-422.[doi:10.11992/tis.202503002]
LIU Xin,ZENG Kui,CHEN Feng.Transformer action recognition model based on multi-query token selection mechanism[J].CAAI Transactions on Intelligent Systems,2026,21(2):410-422.[doi:10.11992/tis.202503002]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
21
期数:
2026年第2期
页码:
410-422
栏目:
学术论文—机器感知与模式识别
出版日期:
2026-03-05
- Title:
-
Transformer action recognition model based on multi-query token selection mechanism
- 作者:
-
刘歆, 曾奎, 陈奉
-
重庆邮电大学 计算机科学与技术学院, 重庆 400065
- Author(s):
-
LIU Xin, ZENG Kui, CHEN Feng
-
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
-
- 关键词:
-
行为识别; 注意力机制; 特征融合; 时间注意力; 空间注意力; 动态特征; 局部特征; 时空特征提取模块
- Keywords:
-
action recognition; attention mechanism; feature fusion; temporal attention; spatial attention; dynamic features; local feature; spatio-temporal feature extraction module
- 分类号:
-
TP391.41
- DOI:
-
10.11992/tis.202503002
- 摘要:
-
针对视频行为识别中ViTs(vision Transformers)模型的空间注意力无法聚焦浅层局部特征、时间注意力无法准确捕捉动态特征等问题,提出了一种基于多查询token选择机制的Transformer行为识别模型。该模型构建了由多个时空特征注意力模块组成的局部特征聚合模块,每个时空特征注意力模块通过3D卷积结合通道和空间注意力聚焦浅层局部特征。构建了由多个时空处理单元组成的全局时空特征提取模块,每个时空处理单元包括:混合空间感知模块、多查询token选择的时间注意力模块和时空特征融合模块。混合空间感知模块在全局空间注意力机制之前引入3D深度可分离卷积,增强对局部邻域的时空特征关注;多查询token选择的时间注意力模块通过多查询token选择机制对每一帧的特征筛选,完成背景的弱化、人体动作的强化;时空特征融合模块通过顺序融合的方式实现空间与时间特征的高效融合。在不同数据集上的实验结果表明,该方法的识别效果优于基线模型。
- Abstract:
-
To address the limitations of ViTs (vision Transformers) in video action recognition, specifically, the inability of spatial attention to focus on shallow local features and the inaccuracy of temporal attention in capturing dynamic information, we propose a Transformer-based action recognition model incorporating a multi-query token selection mechanism. The model introduces a Local Feature Aggregation Module composed of multiple Spatiotemporal Attention Blocks, where each block employs 3D convolutions combined with channel and spatial attention to enhance the focus on shallow local features. Furthermore, a Global Spatiotemporal Feature Perception Module is constructed, consisting of several Spatiotemporal Processing Units. Each unit comprises: (1) a Hybrid Spatial Perception Module, which incorporates 3D depthwise separable convolutions before the global spatial attention mechanism to strengthen attention to local spatiotemporal neighborhoods; (2) a Temporal Attention Module with Multi-Query Token Selection, which filters features of each frame to suppress background noise and emphasize human actions; (3) a Spatiotemporal Feature Fusion Module, which efficiently integrates spatial and temporal features through sequential fusion. Experimental results on different datasets demonstrate that the proposed method outperforms baseline models.
备注/Memo
收稿日期:2025-3-3。
基金项目:重庆市留学人员回国创业创新支持计划项目(CX2024086);成都市重点研发支撑计划区域科技创新合作项目(2023-YF11-00015-HZ).
作者简介:刘歆,副教授,博士,主要研究方向为机器学习、数据分析、行为识别以及图像处理。E-mail:liuxin@cqupt.edu.cn。;曾奎,硕士研究生,主要研究方向为视频行为识别。E-mail:a13350326994@163.com。;陈奉,讲师,博士,主要研究方向为智能数据分析。E-mail:chenfeng@cqupt.edu.cn。
通讯作者:陈奉. E-mail:chenfeng@cqupt.edu.cn
更新日期/Last Update:
1900-01-01