[1]SHAO Kai,WANG Mingzheng,WANG Guangyu.Transformer-based multiscale remote sensing semantic segmentation network[J].CAAI Transactions on Intelligent Systems,2024,19(4):920-929.[doi:10.11992/tis.202304026]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
19
Number of periods:
2024 4
Page number:
920-929
Column:
学术论文—智能系统
Public date:
2024-07-05
- Title:
-
Transformer-based multiscale remote sensing semantic segmentation network
- Author(s):
-
SHAO Kai1; 2; 3; WANG Mingzheng1; WANG Guangyu1; 2
-
1. School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
2. Chongqing Key Laboratory of Mobile Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
3. Engineering Research Center of Mobile Communications of the Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
-
- Keywords:
-
remote sensing image; semantic segmentation; convolutional neural network; Transformer; global contextual information; multiscale receptive field; encoder; decoder
- CLC:
-
TP391
- DOI:
-
10.11992/tis.202304026
- Abstract:
-
For improving the semantic segmentation effect of remote sensing images, this paper proposes a Transformer based multi-scale Transformer network(MSTNet) based on the characteristics of small inter-class variance and large intra-class variance of segmentation targets, focusing on two key points: global contextual information and multi-scale semantic features. The MSTNet consists of an encoder and a decoder. The encoder includes an improved visual attention network(VAN) backbone based on Transformer and an improved multi-scale semantic feature extraction module(MSFEM) based on atrous spatial pyramid pooling(ASPP) to extract multi-scale semantic features. The decoder is designed with a lightweight multi-layer perception(MLP) and an encoder, to fully analyze the global contextual information and multi-scale representations features extracted by utilizing the inductive property of transformer. The proposed MSTNet was validated on two high-resolution remote sensing semantic segmentation datasets, ISPRS Potsdam and LoveDA, achieving an average intersection over union(mIoU) of 79.50% and 54.12%, and an average F1-score(mF1) of 87.46% and 69.34%, respectively. The experimental results verify that the proposed method has effectively improved the semantic segmentation of remote sensing images.