[1]刘歆,曾奎,陈奉.基于多查询token选择机制的Transformer行为识别模型[J].智能系统学报,2026,21(2):410-422.[doi:10.11992/tis.202503002]
 LIU Xin,ZENG Kui,CHEN Feng.Transformer action recognition model based on multi-query token selection mechanism[J].CAAI Transactions on Intelligent Systems,2026,21(2):410-422.[doi:10.11992/tis.202503002]
点击复制

基于多查询token选择机制的Transformer行为识别模型

参考文献/References:
[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2020–10–22)[2025–03–03]. https://arxiv.org/abs/2010.11929.
[2] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10012–10022.
[3] FAN Haoqi, XIONG Bo, MANGALAM K, et al. Multiscale vision transformers[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 6824–6835.
[4] ZHANG D J, LI Kunchang, CHEN Yunpeng, et al. MorphMLP: a self-attention-free, MLP-like backbone for image and video [EB/OL]. (2021–11–24)[2025–03–03]. https://arxiv.org/abs/2111.12527v1.
[5] FEICHTENHOFER C, FAN Haoqi, MALIK J, et al. SlowFast networks for video recognition[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6201–6210.
[6] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
[7] HOWARD A G, ZHU Menglong, CHEN Bo, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017–04–17)[2025–03–03]. https://arxiv.org/abs/1704.04861.
[8] LI Xianhang, WANG Yali, ZHOU Zhipeng, et al. SmallBigNet: integrating core and contextual views for video classification[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1089–1098.
[9] ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2022: 6816–6826.
[10] BERTASIUS G, WANG Heng, TORRESANI L. Is space-time attention all you need for video understanding? [EB/OL]. (2021–02–09)[2025–03–03]. https://arxiv.org/abs/2102.05095.
[11] LIU Ze, NING Jia, CAO Yue, et al. Video swin Transformer[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 3192–3201.
[12] LI Kunchang, WANG Yali, GAO Peng, et al. UniFormer: unified Transformer for efficient spatiotemporal representation learning[EB/OL]. (2022–01–12)[2025–03–03]. https://arxiv.org/abs/2201.04676.
[13] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[J]. Advances in neural information processing systems, 2014, 27: 568-576
[14] HORN B K P, SCHUNCK B G. Determining optical flow[J]. Artificial intelligence, 1981, 17(1-3): 185-203
[15] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933–1941.
[16] FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal residual networks for video action recognition[EB/OL]. (2016–11–07)[2025–03–03]. https://arxiv.org/abs/1611.02155.
[17] WANG Limin, XIONG Yuanjun, WANG Zhe, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016: 20–36.
[18] ZHOU Bolei, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018: 831–846.
[19] LIN Ji, GAN Chuang, HAN Song. TSM: temporal shift module for efficient video understanding[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2020: 7082–7092.
[20] JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(1): 221-231
[21] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2016: 4489–4497.
[22] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 4724–4733.
[23] TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6450–6459.
[24] QIU Zhaofan, YAO Ting, MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5534–5542.
[25] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 200–210.
[26] 田枫, 卫宁彬, 刘芳, 等. 基于时空-动作自适应融合网络的油田作业行为识别[J]. 智能系统学报, 2024, 19(6): 1407-1418 TIAN Feng, WEI Ningbin, LIU Fang, et al. Oilfield operation behavior recognition based on spatio-temporal and action adaptive fusion network[J]. CAAI transactions on intelligent systems, 2024, 19(6): 1407-1418
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008
[28] 陈卓超. 基于Transformer模型的行为识别研究及系统实现[D]. 北京: 北京邮电大学, 2024. CHEN Zhuochao. Research and system implementation of behavior recognition based on Transformer model[D]. Beijing: Beijing University of Posts and Telecommunications, 2024.
[29] PATRICK M, CAMPBELL D, ASANO Y M, et al. Keeping your eye on the ball: trajectory attention in video Transformers[EB/OL]. (2021–06–09)[2025–03–03]. https://arxiv.org/abs/2106.05392.
[30] LI Yanghao, WU Chaoyuan, FAN Haoqi, et al. Mvitv2: improved multiscale vision Transformers for classification and detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans: IEEE, 2022: 4804–4814.
[31] LI Kuchang, WANG Yali, HE Yinan, et al. Uniformerv2: spatiotemporal learning by arming image vits with video uniformer[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 1632–1643.
[32] LOU Meng, ZHANG Shu, ZHOU Hongyu, et al. TransXNet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition[EB/OL]. (2023–10–30)[2025–03–03]. https://arxiv.org/abs/2310.19380.
[33] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1800–1807.
[34] HU Jie, SHEN Li, SUN Gang. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7132–7141.
[35] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018: 3–19.
[36] ZHENG J, REZAGHOLIZADEH M, PASSBAN P. Dynamic position encoding for Transformers[EB/OL]. (2022–04–18)[2025–03–03]. https://arxiv.org/abs/2204.08142.
[37] DONG Xiaoyi, BAO Jianmin, CHEN Dongdong, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 12124–12134.
[38] GOYAL R, KAHOU S E, MICHALSKI V, et al. The “something something” video database for learning and evaluating visual common sense[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5843–5851.
[39] LOSHCHILOV I, HUTTER F. Stochastic gradient descent with warm restarts[C]//Proceedings of the 5th International Conference on Learning Representations. Toulon: OpenReview. net, 2017: 1–16.
[40] QIU Zhaofan, YAO Ting, NGO C W, et al. Learning spatio-temporal representation with local and global diffusion[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach: IEEE, 2019: 12056–12065.
[41] TRAN D, WANG Heng, FEISZLI M, et al. Video classification with channel-separated convolutional networks[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 352–361.
[42] KONDRATYUK D, YUAN Liangzhe, LI Yandong, et al. MoViNets: mobile video networks for efficient video recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 16020–16030.
[43] NEIMARK D, BAR O, ZOHAR M, et al. Video transformer network[C]//2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal: IEEE, 2021: 3163–3172.
[44] SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck transformers for visual recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 16514–16524.
[45] BULAT A, PEREZ RUA J M, SUDHAKARAN S, et al. Space-time mixing attention for video Transformer[J]. Advances in neural information processing systems, 2021, 34: 19594-19607
[46] LI Kunchang, LI Xinhao, WANG Yi, et al. VideoMamba: state space model for efficient video understanding[C]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2024: 237–255.
[47] LI Yan, JI Bin, SHI Xintian, et al. TEA: temporal excitation and aggregation for action recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 909–918.
[48] LI Kunchang, LI Xianhang, WANG Yali, et al. Ct-Net: channel tensorization network for video classification[EB/OL]. (2021–06–03)[2025–03–03]. https://arxiv.org/abs/2106.01603.
[49] WANG Limin, TONG Zhan, JI Bin, et al. TDN: temporal difference networks for efficient action recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 1895–1904.
相似文献/References:
[1]梅雪,胡石,许松松,等.基于多尺度特征的双层隐马尔可夫模型及其在行为识别中的应用[J].智能系统学报,2012,7(6):512.
 MEI Xue,HU Shi,XU Songsong,et al.Multi scale feature based double layer HMM and its application in behavior recognition[J].CAAI Transactions on Intelligent Systems,2012,7():512.
[2]申凯,王晓峰,杨亚东.基于双向消息链路卷积网络的显著性物体检测[J].智能系统学报,2019,14(6):1152.[doi:10.11992/tis.201812003]
 SHEN Kai,WANG Xiaofeng,YANG Yadong.Salient object detection based on bidirectional message link convolution neural network[J].CAAI Transactions on Intelligent Systems,2019,14():1152.[doi:10.11992/tis.201812003]
[3]赵文清,程幸福,赵振兵,等.注意力机制和Faster RCNN相结合的绝缘子识别[J].智能系统学报,2020,15(1):92.[doi:10.11992/tis.201907023]
 ZHAO Wenqing,CHENG Xingfu,ZHAO Zhenbing,et al.Insulator recognition based on attention mechanism and Faster RCNN[J].CAAI Transactions on Intelligent Systems,2020,15():92.[doi:10.11992/tis.201907023]
[4]申翔翔,侯新文,尹传环.深度强化学习中状态注意力机制的研究[J].智能系统学报,2020,15(2):317.[doi:10.11992/tis.201809033]
 SHEN Xiangxiang,HOU Xinwen,YIN Chuanhuan.State attention in deep reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15():317.[doi:10.11992/tis.201809033]
[5]曾碧卿,韩旭丽,王盛玉,等.层次化双注意力神经网络模型的情感分析研究[J].智能系统学报,2020,15(3):460.[doi:10.11992/tis.201812017]
 ZENG Biqing,HAN Xuli,WANG Shengyu,et al.Hierarchical double-attention neural networks for sentiment classification[J].CAAI Transactions on Intelligent Systems,2020,15():460.[doi:10.11992/tis.201812017]
[6]姬晓飞,谢旋,任艳.深度学习的双人交互行为识别与预测算法研究[J].智能系统学报,2020,15(3):484.[doi:10.11992/tis.201812029]
 JI Xiaofei,XIE Xuan,REN Yan.Human interaction recognition and prediction algorithm based on deep learning[J].CAAI Transactions on Intelligent Systems,2020,15():484.[doi:10.11992/tis.201812029]
[7]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740.[doi:10.11992/tis.201910039]
 MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15():740.[doi:10.11992/tis.201910039]
[8]刘董经典,孟雪纯,张紫欣,等.一种基于2D时空信息提取的行为识别算法[J].智能系统学报,2020,15(5):900.[doi:10.11992/tis.201906054]
 LIU Dongjingdian,MENG Xuechun,ZHANG Zixin,et al.A behavioral recognition algorithm based on 2D spatiotemporal information extraction[J].CAAI Transactions on Intelligent Systems,2020,15():900.[doi:10.11992/tis.201906054]
[9]鲍维克,袁春.面向推荐系统的分期序列自注意力网络[J].智能系统学报,2021,16(2):353.[doi:10.11992/tis.202005028]
 BAO Weike,YUAN Chun.Recommendation system with long-term and short-term sequential self-attention network[J].CAAI Transactions on Intelligent Systems,2021,16():353.[doi:10.11992/tis.202005028]
[10]洪恺临,曹江涛,姬晓飞.改进Center-Net网络的自主喷涂机器人室内窗户检测[J].智能系统学报,2021,16(3):425.[doi:10.11992/tis.202005016]
 HONG Kailin,CAO Jiangtao,JI Xiaofei.Indoor window detection of autonomous spraying robot based on improved CenterNet network[J].CAAI Transactions on Intelligent Systems,2021,16():425.[doi:10.11992/tis.202005016]
[11]田枫,卫宁彬,刘芳,等.基于时空-动作自适应融合网络的油田作业行为识别[J].智能系统学报,2024,19(6):1407.[doi:10.11992/tis.202309021]
 TIAN Feng,WEI Ningbin,LIU Fang,et al.Oilfield operation behavior recognition based on spatio-temporal and action adaptive fusion network[J].CAAI Transactions on Intelligent Systems,2024,19():1407.[doi:10.11992/tis.202309021]

备注/Memo

收稿日期:2025-3-3。
基金项目:重庆市留学人员回国创业创新支持计划项目(CX2024086);成都市重点研发支撑计划区域科技创新合作项目(2023-YF11-00015-HZ).
作者简介:刘歆,副教授,博士,主要研究方向为机器学习、数据分析、行为识别以及图像处理。E-mail:liuxin@cqupt.edu.cn。;曾奎,硕士研究生,主要研究方向为视频行为识别。E-mail:a13350326994@163.com。;陈奉,讲师,博士,主要研究方向为智能数据分析。E-mail:chenfeng@cqupt.edu.cn。
通讯作者:陈奉. E-mail:chenfeng@cqupt.edu.cn

更新日期/Last Update: 1900-01-01
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com