[1]张少乐,雷涛,王营博,等.基于多尺度金字塔Transformer的人群计数方法[J].智能系统学报,2024,19(1):67-78.[doi:10.11992/tis.202304044]
ZHANG Shaole,LEI Tao,WANG Yingbo,et al.A crowd counting network based on multi-scale pyramid Transformer[J].CAAI Transactions on Intelligent Systems,2024,19(1):67-78.[doi:10.11992/tis.202304044]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第1期
页码:
67-78
栏目:
学术论文—机器学习
出版日期:
2024-01-05
- Title:
-
A crowd counting network based on multi-scale pyramid Transformer
- 作者:
-
张少乐1, 雷涛2,3, 王营博2, 周强1, 薛明园2, 赵伟强4
-
1. 陕西科技大学 电气与控制工程学院, 陕西 西安 710021;
2. 陕西科技大学 电子信息与人工智能学院, 陕西 西安 710021;
3. 陕西科技大学 陕西省人工智能联合实验室, 陕西 西安 710021;
4. 中电科西北集团有限公司西安分公司, 陕西 西安 710065
- Author(s):
-
ZHANG Shaole1, LEI Tao2,3, WANG Yingbo2, ZHOU Qiang1, XUE Mingyuan2, ZHAO Weiqiang4
-
1. School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China;
2. School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China;
3. Shaanxi Joint Laboratory of Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China;
4. China Electronics Technology Group Corporation Northwest Group Corporation Xi’an Branch, Xi’an 710065, China
-
- 关键词:
-
密集人群; 人群计数; 多尺度; 金字塔; Transformer; 自注意力; 密度图; 深度监督
- Keywords:
-
dense crowd; crowd counting; multi-scale; pyramid; Transformer; self-attention; density map; deep supervision
- 分类号:
-
TP391.41
- DOI:
-
10.11992/tis.202304044
- 文献标志码:
-
2024-01-03
- 摘要:
-
针对密集人群场景中背景复杂、目标尺度变化较大导致人群计数精度较低的问题,本文提出一种基于多尺度金字塔Transformer的人群计数方法(multi-scale pyramid transformer network, MSPT-Net)。在特征提取阶段设计了一种基于深度可分离自注意力的金字塔Transformer主干网络结构,该网络结构能有效捕获图像的局部和全局信息,从而有效解决人群密度图像背景复杂导致计数精度低的问题;设计了一种特征金字塔融合模块及多尺度感受野的回归头,实现了密集人群图像浅层细节特征和深层语义特征的高效融合,增强了网络对不同尺度目标的捕获能力;采用深度监督的训练方法在3个公开数据集上对提出的方法进行验证。实验结果表明,本文方法在全监督与弱监督学习策略中,与目前主流的人群计数方法相比,实现了更高精度的人群计数,克服了主流方法对背景复杂、目标尺度变化大的密集人群图像计数精度低的问题,同时本文方法保持着更小的参数量与计算量。
- Abstract:
-
A crowd counting network based on multi-scale pyramid Transformer (MSPT-Net) is proposed to address the problem of low accuracy in crowd counting in dense crowd scenes caused by complex backgrounds and large target scale variations. A pyramid transformer backbone network structure based on depth separable self-attention is designed in the feature extraction phase to effectively capture local as well as global information of the image, thereby effectively addressing the problem of low counting accuracy in crowd density images caused by complex backgrounds. A feature pyramid fusion module and a regression head with multi-scale receptive fields are designed to efficiently integrate shallow detail features and deep semantic features in dense crowd scenes, enhancing the network’s ability to capture targets of different scales. Lastly, the proposed model is validated using a deep supervision training method on three publicly available datasets. The experimental results show that the proposed MSPT-Net achieves higher crowd counting accuracy in the fully supervised and weakly supervised learning strategies as compared to mainstream crowd counting networks, overcoming the issue of low counting accuracy in dense crowd images with complex backgrounds and significant changes in target scales. At the same time, the method in this paper keeps the parameter number and calculation amount smaller.
备注/Memo
收稿日期:2023-04-30。
基金项目:国家自然科学基金项目 (62271296,62201334);陕西省重点研发计划项目(2021ZDLGY08-07);陕西省杰出青年科学基金项目(2021JC-47).
作者简介:张少乐,硕士研究生,主要研究方向为计算机视觉、机器学习。E-mail:210612054@sust.edu.cn;雷涛,教授,博士生导师,陕西科技大学电子信息与人工智能学院副院长、IEEE 高级会员,主要研究方向为计算机视觉、机器学习。主持国家自然科学基金项目 5 项、陕西省重点研发计划、中国博士后科学基金等6项,授权发明专利 15 项,获陕西省科学技术二等奖 1 项(自然科学奖)。发表学术论文 90 余篇。E-mail:leitao@sust.edu.cn;王营博,讲师,主要研究方向为散射环境下图像复原与场景感知。参与国家自然科学基金面上项目、高分重大专项等项目 5 项,授权发明专利 8 项,授权软件著作权 1 项。发表学术论文 20 余篇。E-mail:wangyingbo@sust.edu.cn
通讯作者:雷涛. E-mail:leitao@sust.edu.cn
更新日期/Last Update:
1900-01-01