[1]曹锦纲,张泽恩,张铭泉.改进单点定位模型的轻量级端到端文本识别方法[J].智能系统学报,2024,19(6):1503-1517.[doi:10.11992/tis.202307012]
CAO Jingang,ZHANG Zeen,ZHANG Mingquan.A lightweight end-to-end text recognition method based on SPTS[J].CAAI Transactions on Intelligent Systems,2024,19(6):1503-1517.[doi:10.11992/tis.202307012]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第6期
页码:
1503-1517
栏目:
学术论文—人工智能基础
出版日期:
2024-12-05
- Title:
-
A lightweight end-to-end text recognition method based on SPTS
- 作者:
-
曹锦纲1,2, 张泽恩1,2, 张铭泉1,2
-
1. 华北电力大学 控制与计算机工程学院, 河北 保定 071003;
2. 华北电力大学 复杂能源系统智能计算教育部工程研究中心, 河北 保定 071003
- Author(s):
-
CAO Jingang1,2, ZHANG Zeen1,2, ZHANG Mingquan1,2
-
1. School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China;
2. Engineering Research Center of intelligent Computing for Complex Energy Systems Ministry of Education, Baoding 071003, China
-
- 关键词:
-
注意力模块; 自回归解码; 轻量级网络; 单点定位; 文本识别; 端到端; 编码器; 解码器
- Keywords:
-
attention module; autoregressive decoder; lightweight network; single point position; text spotting; end to end; encoder; decoder
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202307012
- 摘要:
-
针对现有文本识别方法推理速度慢、模型参数量大的问题,提出一种改进单点定位模型(single-point scene text spotting, SPTS)的轻量级端到端文本识别方法。首先,引入PP-LCNet作为骨干网络进行特征提取;接着,在解码器之前设计三局部通道注意力模块,通过3种不同尺度的一维卷积增强通道间的信息交互;然后,提出用局部增强注意力模块替换原解码器中的前馈网络部分,通过深度可分离卷积增强文本特征空间关联性;再后,在各层解码器之后设计标记选择模块,通过显著性标记突出文本特征,减少无关像素的累积;最后,通过自回归解码方式预测出相应识别结果。将所提方法在Total-Text、CTW1500和ICDAR2015数据集上进行实验,并与6种先进方法(ABCNet、MANGO、ABCNet v2、SPTS、SwinTextSpotter和TESTR)对比。相比于SPTS方法,所提方法的推理速度分别提高了19.6、35.7、21.1 f/s,参数量减少了70.7%,证明了所提方法的有效性。
- Abstract:
-
Addressing the problems of slow reasoning speed and the large number of model parameters in existing text spotting methods, this paper presents a lightweight end-to-end text spotting method based on single-point scene text spotting. First, PP-LCNet was introduced as the backbone network for feature extraction. Then, a three-local channel attention module was designed before the decoder, utilizing three different scales of one-dimensional convolution to enhance information interaction between channels. Next, a locally enhanced attention module was proposed to replace the feedforward network component in the original decoder, thereby improving the spatial correlation of text features using depthwise separable convolution. Subsequently, a token selector module was added after each decoder layer to highlight text features with saliency markers and reduce the accumulation of irrelevant pixels. Finally, recognition results were predicted using an autoregressive decoding method. The proposed method was tested on three datasets, namely, Total-Text, CTW1500, and ICDAR2015, and then compared with six advanced methods (ABCNet, MANGO, ABCNet v2, SPTS, SwinTextSpotter, and TESTR). Compared to the SPTS method, the proposed method achieved increments in inference speed of 19.6, 35.7, and 21.1 frames/s, respectively, and reduced the number of parameters by 70.7%, demonstrating its effectiveness.
更新日期/Last Update:
2024-11-05