<-Previous Article Next Article->

[1]CAO Jingang,ZHANG Zeen,ZHANG Mingquan.A lightweight end-to-end text recognition method based on SPTS[J].CAAI Transactions on Intelligent Systems,2024,19(6):1503-1517.[doi:10.11992/tis.202307012]

Copy

A lightweight end-to-end text recognition method based on SPTS

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 19 Number of periods: 2024 6 Page number: 1503-1517 Column: 学术论文—人工智能基础 Public date: 2024-12-05

Title:: A lightweight end-to-end text recognition method based on SPTS

Author(s):: CAO Jingang¹; 2; ZHANG Zeen¹; 2; ZHANG Mingquan¹; 2; 1. School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China;
2. Engineering Research Center of intelligent Computing for Complex Energy Systems Ministry of Education, Baoding 071003, China

Keywords:: attention module; autoregressive decoder; lightweight network; single point position; text spotting; end to end; encoder; decoder

CLC:: TP391

DOI:: 10.11992/tis.202307012

Abstract:: Addressing the problems of slow reasoning speed and the large number of model parameters in existing text spotting methods, this paper presents a lightweight end-to-end text spotting method based on single-point scene text spotting. First, PP-LCNet was introduced as the backbone network for feature extraction. Then, a three-local channel attention module was designed before the decoder, utilizing three different scales of one-dimensional convolution to enhance information interaction between channels. Next, a locally enhanced attention module was proposed to replace the feedforward network component in the original decoder, thereby improving the spatial correlation of text features using depthwise separable convolution. Subsequently, a token selector module was added after each decoder layer to highlight text features with saliency markers and reduce the accumulation of irrelevant pixels. Finally, recognition results were predicted using an autoregressive decoding method. The proposed method was tested on three datasets, namely, Total-Text, CTW1500, and ICDAR2015, and then compared with six advanced methods (ABCNet, MANGO, ABCNet v2, SPTS, SwinTextSpotter, and TESTR). Compared to the SPTS method, the proposed method achieved increments in inference speed of 19.6, 35.7, and 21.1 frames/s, respectively, and reduced the number of parameters by 70.7%, demonstrating its effectiveness.

References:: [1] 刘崇宇, 陈晓雪, 罗灿杰, 等. 自然场景文本检测与识别的深度学习方法[J]. 2021, 26(6): 1330-1367.
LIU Congyu, CHEN Xiaoxue, LUO Canjie, et al. Deep learning methods for scene text detection and recognition[J]. Journal of image and graphics, 2021, 26(6): 1330-1367.
[2] FENG Wei, HE Wenhao, YIN Fei, et al. TextDragon: an end-to-end framework for arbitrary shaped text spotting[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 9075-9084.
[3] QIAO Liang, TANG Sanli, CHENG Zhanzhan, et al. Text perceptron: towards end-to-end arbitrary-shaped text spotting[C]//Proceedings of the AAAI conference on artificial intelligence. New York: AAAI, 2020: 11899-11907.
[4] LIU Yuliang, CHEN Hao, SHEN Chunhua, et al. ABCNet: real-time scene text spotting with adaptive bezier-curve network[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9806-9815.
[5] LIU Yuliang, SHEN Chunhua, JIN Lianwen, et al. ABCNet v2: adaptive bezier-curve network for real-time end-to-end text spotting[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 44(11): 8048-8064.
[6] HUANG Mingxin, LIU Yuliang, PENG Zhenghao, et al. SwinTextSpotter: scene text spotting via better synergy between text detection and text recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 4583-4593.
[7] WU Jingjing, LYU Pengyuan, LU Guangming, et al. Decoupling recognition from detection: single shot self-reliant scene text spotter[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa : ACM, 2022: 1319-1328.
[8] PENG Dezhi, WANG Xinyu, LIU Yuliang, et al. SPTS: single-point text spotting[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 4272-4281.
[9] ZHANG Xiang, SU Yongwen, TRIPATHI S, et al. Text spotting transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9509-9518.
[10] KITTENPLON Y, LAVI I, FOGEL S, et al. Towards weakly-supervised text spotting using a multi-task transformer[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 4594-4603.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-06-12)[2021-01-01]. http://arxiv.org/abs/1706.03762.
[12] LIU Ze, LIN Yutong, CAO Yue, et al. Swin Transformer: hierarchical Vision Transformer using Shifted Windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.
[13] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020: 213-229.
[14] CUI Cheng, GAO Tingquan, WEI Shengyu, et al. PP-LCNet: a lightweight CPU convolutional neural network[EB/OL]. (2021-09-17)[2021-01-01]. http://arxiv.org/abs/2109.15099.
[15] LIU Xuebo, LIANG Ding, YAN Shi, et al. FOTS: fast oriented text spotting with a unified network[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5676-5685.
[16] ZU Xinyan, YU Haiyang, LI Bin, et al. Towards accurate video text spotting with text-wise semantic reasoning[C]//Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. Macau: International Joint Conferences on Artificial Intelligence Organization, 2023: 1858-1866.
[17] LYU Pengyuan, LIAO Minghui, YAO Cong, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes[C]//European Conference on Computer Vision. Cham: Springer, 2018: 71-88.
[18] HE Kaiming, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.
[19] GARCIA-BORDILS S, KARATZAS D, RUSI?OL M. STEP - towards structured scene-text spotting[C]//2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 872-881.
[20] LIAO Minghui, PANG Guan, HUANG Jing, et al. Mask TextSpotter v3: segmentation proposal network for robust scene text spotting[C]//VEDALDI A, BISCHOF H, BROX T, et al. European Conference on Computer Vision. Cham: Springer, 2020: 706-722.
[21] WANG Wenhai, XIE Enze, LI Xiang, et al. PAN++: towards efficient and accurate end-to-end spotting of arbitrarily-shaped text[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 44(9): 5349-5367.
[22] RONEN R, TSIPER S, ANSCHEL O, et al. GLASS: global to local attention for scene-text spotting[C]//AVIDAN S, BROSTOW G, CISSé M, et al. European Conference on Computer Vision. Cham: Springer, 2022: 249-266.
[23] LIU Wei, CHEN Chaofeng, WONG K Y. Char-net: a character-aware neural network for distorted scene text recognition[C]//Proceedings of the AAAI conference on artificial intelligence New Orleans: AAAI, 2018.
[24] WANG Pengfei, ZHANG Chengquan, QI Fei, et al. PGNet: real-time arbitrarily-shaped text spotting with point gathering network[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(4): 2782-2790.
[25] QIAO Liang, CHEN Ying, CHENG Zhanzhan, et al. MANGO: a mask attention guided one-stage scene text spotter[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(3): 2467-2476.
[26] YE Maoyuan, ZHANG Jing, ZHAO Shanshan, et al. DeepSolo: let transformer decoder with explicit points solo for text spotting[EB/OL]. (2022-11-19)[2022-12-01]. http://arxiv.org/abs/2211.10772.
[27] ZHU Xizhou, SU Weijie, LU Lewei, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. (2010-10-18)[2021-01-01]. http://arxiv.org/abs/2010.04159.
[28] CHEN Ting, SAXENA S, LI Lala, et al. Pix2seq: a language modeling framework for object detection[EB/OL]. (2021-09-22)[2021-12-01]. http://arxiv.org/abs/2109.10852.
[29] GUPTA A, VEDALDI A, ZISSERMAN A. Synthetic data for text localisation in natural images[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2315-2324.
[30] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on Robust Reading[C]//2015 13th International Conference on Document Analysis and Recognition. Tunis: IEEE, 2015: 1156-1160.
[31] LIU Yuliang, JIN Lianwen, ZHANG Shuaitao, et al. Detecting curve text in the wild: new dataset and new solution[EB/OL]. (2017-12-06)[2021-01-01]. http://arxiv.org/abs/1712.02170.
[32] CHNG C K, CHAN C S. Total-text: a comprehensive dataset for scene text detection and recognition[C]//2017 14th IAPR International Conference on Document Analysis and Recognition. Kyoto: IEEE, 2017: 935-942.
[33] NAYEF N, YIN Fei, BIZID I, et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification - RRC-MLT[C]//2017 14th IAPR International Conference on Document Analysis and Recognition. Kyoto: IEEE, 2017: 1454-1459.
[34] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018: 122-138.
[35] HOWARD A, SANDLER M, CHEN Bo, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 1314-1324.
[36] HAN Kai, WANG Yunhe, TIAN Qi, et al. GhostNet: more features from cheap operations[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1577-1586.
[37] TAN Mingxing, LE Q V. MixConv: mixed depthwise convolutional kernels[EB/OL]. (2019-07-22)[2021-01-01]. http://arxiv.org/abs/1907.09595.
[38] WAN A, DAI Xiaoliang, ZHANG Peizhao, et al. FBNetV2: differentiable neural architecture search for spatial and channel dimensions[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 12962-12971.

Similar References:

Memo

Last Update: 2024-11-05

A lightweight end-to-end text recognition method based on SPTS PDF DownloadHTML

Memo

A lightweight end-to-end text recognition method based on SPTS

PDF Download HTML