[1]ZHANG Shaole,LEI Tao,WANG Yingbo,et al.A crowd counting network based on multi-scale pyramid Transformer[J].CAAI Transactions on Intelligent Systems,2024,19(1):67-78.[doi:10.11992/tis.202304044]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
19
Number of periods:
2024 1
Page number:
67-78
Column:
学术论文—机器学习
Public date:
2024-01-05
- Title:
-
A crowd counting network based on multi-scale pyramid Transformer
- Author(s):
-
ZHANG Shaole1; LEI Tao2; 3; WANG Yingbo2; ZHOU Qiang1; XUE Mingyuan2; ZHAO Weiqiang4
-
1. School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China;
2. School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China;
3. Shaanxi Joint Laboratory of Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China;
4. China Electronics Technology Group Corporation Northwest Group Corporation Xi’an Branch, Xi’an 710065, China
-
- Keywords:
-
dense crowd; crowd counting; multi-scale; pyramid; Transformer; self-attention; density map; deep supervision
- CLC:
-
TP391.41
- DOI:
-
10.11992/tis.202304044
- Abstract:
-
A crowd counting network based on multi-scale pyramid Transformer (MSPT-Net) is proposed to address the problem of low accuracy in crowd counting in dense crowd scenes caused by complex backgrounds and large target scale variations. A pyramid transformer backbone network structure based on depth separable self-attention is designed in the feature extraction phase to effectively capture local as well as global information of the image, thereby effectively addressing the problem of low counting accuracy in crowd density images caused by complex backgrounds. A feature pyramid fusion module and a regression head with multi-scale receptive fields are designed to efficiently integrate shallow detail features and deep semantic features in dense crowd scenes, enhancing the network’s ability to capture targets of different scales. Lastly, the proposed model is validated using a deep supervision training method on three publicly available datasets. The experimental results show that the proposed MSPT-Net achieves higher crowd counting accuracy in the fully supervised and weakly supervised learning strategies as compared to mainstream crowd counting networks, overcoming the issue of low counting accuracy in dense crowd images with complex backgrounds and significant changes in target scales. At the same time, the method in this paper keeps the parameter number and calculation amount smaller.