[1]张铭泉,张泽恩,曹锦纲,等.结合Segformer与增强特征金字塔的文本检测方法[J].智能系统学报,2024,19(5):1111-1125.[doi:10.11992/tis.202301013]
ZHANG Mingquan,ZHANG Zeen,CAO Jingang,et al.Text detection method combining Segformer with an enhanced feature pyramid[J].CAAI Transactions on Intelligent Systems,2024,19(5):1111-1125.[doi:10.11992/tis.202301013]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第5期
页码:
1111-1125
栏目:
学术论文—机器学习
出版日期:
2024-09-05
- Title:
-
Text detection method combining Segformer with an enhanced feature pyramid
- 作者:
-
张铭泉1,2, 张泽恩1,2, 曹锦纲1,2, 邵绪强1,2
-
1. 华北电力大学 控制与计算机工程学院, 河北 保定 071003;
2. 华北电力大学 复杂能源系统智能计算教育部工程研究中心, 河北 保定 071003
- Author(s):
-
ZHANG Mingquan1,2, ZHANG Zeen1,2, CAO Jingang1,2, SHAO Xuqiang1,2
-
1. School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China;
2. Engineering Research Center of intelligent Computing for Complex Energy Systems Ministry of Education, Baoding 071003, China
-
- 关键词:
-
文本检测; 特征金字塔; 注意力机制; Segformer; Ghost模块; 多尺度特征融合; 平均池化; 最大池化
- Keywords:
-
text detection; enhanced feature pyramid; attention mechanism; Segformer; ghost convolution; multiscale feature fusion; average pooling; max pooling
- 分类号:
-
TP391.4
- DOI:
-
10.11992/tis.202301013
- 文献标志码:
-
2024-08-28
- 摘要:
-
针对自然场景文本检测算法中的小尺度文本漏检、类文本像素误检以及边缘定位不准确的问题,提出一种基于Segformer和增强特征金字塔的文本检测模型。该模型首先采用基于混合Transformer (mix Transformer, MiT)的编码器生成多尺度特征图;然后,在具有特征金字塔结构解码器的上采样部分,提出级联融合注意力模块,通过全局平均池化、全局最大池化和Ghost模块获取全局通道信息并保留文本特征;接着,在解码器的特征融合部分提出两级正交融合注意力模块,利用非对称卷积分别从水平和垂直方向进行信息增强;最后,利用可微分二值化对结果进行后处理。将本文方法在ICDAR2015、ShopSign1265和MTWI 3个数据集上进行实验,相比于其他8种方法,本文方法的F值均为最优,分别达到了87.8%、59.1%和74.8%。结果表明,本文方法有效提高了文本检测的准确率。
- Abstract:
-
To address the issues of small-scale text omission, text-like pixel misdetection, and inaccurate edge localization in text detection algorithms for natural scenes, we propose a text detection model based on Segformer and an enhanced feature pyramid. First, the model employs an MiT-B2-based encoder to generate multiscale feature maps. Subsequently, during the upsampling phase of the decoder, a cascaded fusion attention module is introduced, which acquires global channel information and text features through global average pooling, global max pooling, and ghost convolution. Then, a two-level orthogonal fusion attention module utilizes asymmetric convolution to enhance the information in the feature fusion section horizontally and vertically. Finally, the results are post-processed using differentiable binarization. The experiments were conducted on the ICDAR2015, ShopSign1265, and MTWI datasets. Compared with the other eight methods, the proposed method achieved the highest F-values, reaching 87.8%, 59.1%, and 74.8%%, respectively. These results demonstrate that the method effectively improves the accuracy of text detection.
更新日期/Last Update:
2024-09-05