<-Previous Article Next Article->

[1]ZHANG Mingquan,ZHANG Zeen,CAO Jingang,et al.Text detection method combining Segformer with an enhanced feature pyramid[J].CAAI Transactions on Intelligent Systems,2024,19(5):1111-1125.[doi:10.11992/tis.202301013]

Copy

Text detection method combining Segformer with an enhanced feature pyramid

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 19 Number of periods: 2024 5 Page number: 1111-1125 Column: 学术论文—机器学习 Public date: 2024-09-05

Title:: Text detection method combining Segformer with an enhanced feature pyramid

Author(s):: ZHANG Mingquan¹; 2; ZHANG Zeen¹; 2; CAO Jingang¹; 2; SHAO Xuqiang¹; 2; 1. School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China;
2. Engineering Research Center of intelligent Computing for Complex Energy Systems Ministry of Education, Baoding 071003, China

Keywords:: text detection; enhanced feature pyramid; attention mechanism; Segformer; ghost convolution; multiscale feature fusion; average pooling; max pooling

CLC:: TP391.4

DOI:: 10.11992/tis.202301013

Abstract:: To address the issues of small-scale text omission, text-like pixel misdetection, and inaccurate edge localization in text detection algorithms for natural scenes, we propose a text detection model based on Segformer and an enhanced feature pyramid. First, the model employs an MiT-B2-based encoder to generate multiscale feature maps. Subsequently, during the upsampling phase of the decoder, a cascaded fusion attention module is introduced, which acquires global channel information and text features through global average pooling, global max pooling, and ghost convolution. Then, a two-level orthogonal fusion attention module utilizes asymmetric convolution to enhance the information in the feature fusion section horizontally and vertically. Finally, the results are post-processed using differentiable binarization. The experiments were conducted on the ICDAR2015, ShopSign1265, and MTWI datasets. Compared with the other eight methods, the proposed method achieved the highest F-values, reaching 87.8%, 59.1%, and 74.8%%, respectively. These results demonstrate that the method effectively improves the accuracy of text detection.

References:: [1] 朱志颖. 基于深度学习的街景文本检测与识别研究[D]. 南京: 南京邮电大学, 2023.
ZHU Zhiying. Research on street view text detection and recognition based on deep learning[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2023.
[2] 周燕, 韦勤彬, 廖俊玮, 等. 自然场景文本检测与端到端识别: 深度学习方法[J]. 计算机科学与探索, 2023, 17(3): 577-594.
ZHOU Yan, WEI Qinbin, LIAO Junwei, et al. Natural scene text detection and end-to-end recognition: deep learning methods[J]. Journal of frontiers of computer science and technology, 2023, 17(3): 577-594.
[3] 李祥鹏, 闵卫东, 韩清, 等. 基于深度学习的车牌定位和识别方法[J]. 计算机辅助设计与图形学学报, 2019, 31(6): 979-987.
LI Xiangpeng, MIN Weidong, HAN Qing, et al. License plate location and recognition based on deep learning[J]. Journal of computer-aided design & computer graphics, 2019, 31(6): 979-987.
[4] 刘光辉, 张钰敏, 孟月波, 等. 双分支跨级特征融合的自然场景文本检测[J]. 智能系统学报, 2023, 18(5): 1079-1089.
LIU Guanghui, ZHANG Yumin, MENG Yuebo, et al. Natural scene text detection based on double-branch cross-level feature fusion[J]. CAAI transactions on intelligent systems, 2023, 18(5): 1079-1089.
[5] 王润民, 桑农, 丁丁, 等. 自然场景图像中的文本检测综述[J]. 自动化学报, 2018, 44(12): 2113-2141.
WANG Runmin, SANG Nong, DING Ding, et al. Text detection in natural scene image: a survey[J]. Acta automatica sinica, 2018, 44(12): 2113-2141.
[6] LIU Wei, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//European conference on computer vision. Cham: Springer, 2016: 21-37.
[7] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(6): 1137-1149.
[8] JIANG Yingying, ZHU Xiangyu, WANG Xiaobing, et al. R2CNN: rotational region CNN for orientation robust scene text detection[EB/OL]. (2017-06-29)[2023-01-11]. https://arxiv.org/abs/1706.09579.
[9] LIAO Minghui, SHI Baoguang, BAI Xiang, et al. TextBoxes: a fast text detector with a single deep neural network[C]//Proceedings of the AAAI conference on artificial intelligence. San Francisco: AAAI, 2017: 4161-4167.
[10] LIAO Minghui, SHI Baoguang, BAI Xiang. TextBoxes++: a single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690.
[11] HE Tong, HUANG Weilin, QIAO Yu, et al. Accurate text localization in natural image with cascaded convolutional text network[EB/OL]. (2016-03-31)[2023-01-11]. https://arxiv.org/abs/1603.09423.
[12] LI Yi, WU Zhe, ZHAO Shuang, et al. PSENet: psoriasis severity evaluation network[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto: AAAI, 2020: 800-807.
[13] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[14] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.
[15] WANG Wenhai, XIE Enze, SONG Xiaoge, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 8439-8448.
[16] LIAO Minghui, WAN Zhaoyi, YAO Cong, et al. Real-time scene text detection with differentiable binarizationk[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto: AAAI, 2020: 11474-11481.
[17] 邵海琳, 季怡, 刘纯平, 等. 基于增强特征金字塔网络的场景文本检测算法[J]. 计算机科学, 2022, 49(2): 248-255.
SHAO Hailin, JI Yi, LIU Chunping, et al. Scene text detection algorithm based on enhanced feature pyramid network[J]. Computer science, 2022, 49(2): 248-255.
[18] 雷小唐, 胡靖. 文本中心像素重建实现任意形状的文本检测[J]. 计算机工程与应用, 2023, 59(8): 148-156.
LEI Xiaotang, HU Jing. Text center pixel reconstruction to achieve efficient arbitrary shape text detection[J]. Computer engineering and applications, 2023, 59(8): 148-156.
[19] 梁浩然, 叶凌晨, 梁荣华, 等. 注意力监督策略下的自然场景文本检测算法[J]. 计算机辅助设计与图形学学报, 2022, 34(7): 1011-1019.
LIANG Haoran, YE Lingchen, LIANG Ronghua, et al. Text detection algorithm for natural scenes under attention supervision strategy[J]. Journal of computer-aided design & computer graphics, 2022, 34(7): 1011-1019.
[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. (2020-10-22) [2023-01-11]. https://arxiv.org/abs/2010.11929.
[21] CHU Xiangxiang, TIAN Zhi, ZHANG Bo, et al. Conditional positional encodings for vision transformers[EB/OL]. (2021-02-22) [2023-01-11]. https://arxiv.org/abs/2102.10882.
[22] HAN Kai, XIAO An, WU Enhua, et al. Transformer in transformer[J]. Advances in neural information processing systems, 2021, 34: 15908-15919.
[23] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.
[24] WANG Wenhai, XIE Enze, LI Xiang, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 548-558.
[25] XIE Enze, WANG Wenhai, YU Zhiding, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[J]. Advances in neural information processing systems, 2021, 34: 12077-12090.
[26] HAN Kai, WANG Yunhe, TIAN Qi, et al. GhostNet: more features from cheap operations[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1577-1586.
[27] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on Robust Reading[C]//2015 13th International Conference on Document Analysis and Recognition. Tunis: IEEE, 2015: 1156-1160.
[28] HE Mengchao, LIU Yuliang, YANG Zhibo, et al. ICPR2018 contest on robust reading for multi-type web images[C]//2018 24th International Conference on Pattern Recognition. Beijing: IEEE, 2018: 7-12.
[29] ZHANG Chongsheng, PENG Guowen, TAO Yuefeng, et al. ShopSign: a diverse scene text dataset of Chinese shop signs in street views[EB/OL]. (2019-03-25)[2023-01-11]. https://arxiv.org/abs/1903.10412.
[30] LONG Shangbang, RUAN Jiaqiang, ZHANG Wenjie, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes[C]//European conference on computer vision. Cham: Springer, 2018: 19-35.
[31] WANG Yuxin, XIE Hongtao, ZHA Zhengjun, et al. ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11750-11759.
[32] ZHANG Shixue, ZHU Xiaobin, HOU Jiebo, et al. Deep relational reasoning graph network for arbitrary shape text detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9696-9705.
[33] ZHU Yiqin, CHEN Jianyong, LIANG Lingyu, et al. Fourier contour embedding for arbitrary-shaped text detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3122-3130.
[34] LIAO Minghui, ZOU Zhisheng, WAN Zhaoyi, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(1): 919-931.
[35] LIU Jinpeng, WU Song, HE Dehong, et al. MS-ROCANet: multi-scale residual orthogonal-channel attention network for scene text detection[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 2200-2204.
[36] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//European conference on computer vision. Cham: Springer, 2018: 122-138.
[37] SANDLER M, HOWARD A, ZHU Menglong, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510-4520.
[38] ZHANG Hang, WU Chongruo, ZHANG Zhongyue, et al. ResNeSt: split-attention networks[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 2735-2745.
[39] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11966-11976.

Similar References:

Memo

Last Update: 2024-09-05

Text detection method combining Segformer with an enhanced feature pyramid PDF DownloadHTML

Memo

Text detection method combining Segformer with an enhanced feature pyramid

PDF Download HTML