<-上一篇/Previous Article 下一篇/Next Article->

[1]张铭泉,张泽恩,曹锦纲,等.结合Segformer与增强特征金字塔的文本检测方法[J].智能系统学报,2024,19(5):1111-1125.[doi:10.11992/tis.202301013]
　ZHANG Mingquan,ZHANG Zeen,CAO Jingang,et al.Text detection method combining Segformer with an enhanced feature pyramid[J].CAAI Transactions on Intelligent Systems,2024,19(5):1111-1125.[doi:10.11992/tis.202301013]

点击复制

结合Segformer与增强特征金字塔的文本检测方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 19 期数: 2024年第5期页码: 1111-1125 栏目: 学术论文—机器学习出版日期: 2024-09-05

Title:: Text detection method combining Segformer with an enhanced feature pyramid

作者:: 张铭泉^1,2, 张泽恩^1,2, 曹锦纲^1,2, 邵绪强^1,2; 1. 华北电力大学控制与计算机工程学院, 河北保定 071003;
2. 华北电力大学复杂能源系统智能计算教育部工程研究中心, 河北保定 071003

Author(s):: ZHANG Mingquan^1,2, ZHANG Zeen^1,2, CAO Jingang^1,2, SHAO Xuqiang^1,2; 1. School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China;
2. Engineering Research Center of intelligent Computing for Complex Energy Systems Ministry of Education, Baoding 071003, China

关键词:: 文本检测; 特征金字塔; 注意力机制; Segformer; Ghost模块; 多尺度特征融合; 平均池化; 最大池化

Keywords:: text detection; enhanced feature pyramid; attention mechanism; Segformer; ghost convolution; multiscale feature fusion; average pooling; max pooling

分类号:: TP391.4

DOI:: 10.11992/tis.202301013

文献标志码:: 2024-08-28

摘要:: 针对自然场景文本检测算法中的小尺度文本漏检、类文本像素误检以及边缘定位不准确的问题，提出一种基于Segformer和增强特征金字塔的文本检测模型。该模型首先采用基于混合Transformer (mix Transformer, MiT)的编码器生成多尺度特征图；然后，在具有特征金字塔结构解码器的上采样部分，提出级联融合注意力模块，通过全局平均池化、全局最大池化和Ghost模块获取全局通道信息并保留文本特征；接着，在解码器的特征融合部分提出两级正交融合注意力模块，利用非对称卷积分别从水平和垂直方向进行信息增强；最后，利用可微分二值化对结果进行后处理。将本文方法在ICDAR2015、ShopSign1265和MTWI 3个数据集上进行实验，相比于其他8种方法，本文方法的F值均为最优，分别达到了87.8%、59.1%和74.8%。结果表明，本文方法有效提高了文本检测的准确率。

Abstract:: To address the issues of small-scale text omission, text-like pixel misdetection, and inaccurate edge localization in text detection algorithms for natural scenes, we propose a text detection model based on Segformer and an enhanced feature pyramid. First, the model employs an MiT-B2-based encoder to generate multiscale feature maps. Subsequently, during the upsampling phase of the decoder, a cascaded fusion attention module is introduced, which acquires global channel information and text features through global average pooling, global max pooling, and ghost convolution. Then, a two-level orthogonal fusion attention module utilizes asymmetric convolution to enhance the information in the feature fusion section horizontally and vertically. Finally, the results are post-processed using differentiable binarization. The experiments were conducted on the ICDAR2015, ShopSign1265, and MTWI datasets. Compared with the other eight methods, the proposed method achieved the highest F-values, reaching 87.8%, 59.1%, and 74.8%%, respectively. These results demonstrate that the method effectively improves the accuracy of text detection.

参考文献/References:: [1] 朱志颖. 基于深度学习的街景文本检测与识别研究[D]. 南京: 南京邮电大学, 2023.
ZHU Zhiying. Research on street view text detection and recognition based on deep learning[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2023.
[2] 周燕, 韦勤彬, 廖俊玮, 等. 自然场景文本检测与端到端识别: 深度学习方法[J]. 计算机科学与探索, 2023, 17(3): 577-594.
ZHOU Yan, WEI Qinbin, LIAO Junwei, et al. Natural scene text detection and end-to-end recognition: deep learning methods[J]. Journal of frontiers of computer science and technology, 2023, 17(3): 577-594.
[3] 李祥鹏, 闵卫东, 韩清, 等. 基于深度学习的车牌定位和识别方法[J]. 计算机辅助设计与图形学学报, 2019, 31(6): 979-987.
LI Xiangpeng, MIN Weidong, HAN Qing, et al. License plate location and recognition based on deep learning[J]. Journal of computer-aided design & computer graphics, 2019, 31(6): 979-987.
[4] 刘光辉, 张钰敏, 孟月波, 等. 双分支跨级特征融合的自然场景文本检测[J]. 智能系统学报, 2023, 18(5): 1079-1089.
LIU Guanghui, ZHANG Yumin, MENG Yuebo, et al. Natural scene text detection based on double-branch cross-level feature fusion[J]. CAAI transactions on intelligent systems, 2023, 18(5): 1079-1089.
[5] 王润民, 桑农, 丁丁, 等. 自然场景图像中的文本检测综述[J]. 自动化学报, 2018, 44(12): 2113-2141.
WANG Runmin, SANG Nong, DING Ding, et al. Text detection in natural scene image: a survey[J]. Acta automatica sinica, 2018, 44(12): 2113-2141.
[6] LIU Wei, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//European conference on computer vision. Cham: Springer, 2016: 21-37.
[7] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(6): 1137-1149.
[8] JIANG Yingying, ZHU Xiangyu, WANG Xiaobing, et al. R2CNN: rotational region CNN for orientation robust scene text detection[EB/OL]. (2017-06-29)[2023-01-11]. https://arxiv.org/abs/1706.09579.
[9] LIAO Minghui, SHI Baoguang, BAI Xiang, et al. TextBoxes: a fast text detector with a single deep neural network[C]//Proceedings of the AAAI conference on artificial intelligence. San Francisco: AAAI, 2017: 4161-4167.
[10] LIAO Minghui, SHI Baoguang, BAI Xiang. TextBoxes++: a single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690.
[11] HE Tong, HUANG Weilin, QIAO Yu, et al. Accurate text localization in natural image with cascaded convolutional text network[EB/OL]. (2016-03-31)[2023-01-11]. https://arxiv.org/abs/1603.09423.
[12] LI Yi, WU Zhe, ZHAO Shuang, et al. PSENet: psoriasis severity evaluation network[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto: AAAI, 2020: 800-807.
[13] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[14] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.
[15] WANG Wenhai, XIE Enze, SONG Xiaoge, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 8439-8448.
[16] LIAO Minghui, WAN Zhaoyi, YAO Cong, et al. Real-time scene text detection with differentiable binarizationk[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto: AAAI, 2020: 11474-11481.
[17] 邵海琳, 季怡, 刘纯平, 等. 基于增强特征金字塔网络的场景文本检测算法[J]. 计算机科学, 2022, 49(2): 248-255.
SHAO Hailin, JI Yi, LIU Chunping, et al. Scene text detection algorithm based on enhanced feature pyramid network[J]. Computer science, 2022, 49(2): 248-255.
[18] 雷小唐, 胡靖. 文本中心像素重建实现任意形状的文本检测[J]. 计算机工程与应用, 2023, 59(8): 148-156.
LEI Xiaotang, HU Jing. Text center pixel reconstruction to achieve efficient arbitrary shape text detection[J]. Computer engineering and applications, 2023, 59(8): 148-156.
[19] 梁浩然, 叶凌晨, 梁荣华, 等. 注意力监督策略下的自然场景文本检测算法[J]. 计算机辅助设计与图形学学报, 2022, 34(7): 1011-1019.
LIANG Haoran, YE Lingchen, LIANG Ronghua, et al. Text detection algorithm for natural scenes under attention supervision strategy[J]. Journal of computer-aided design & computer graphics, 2022, 34(7): 1011-1019.
[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. (2020-10-22) [2023-01-11]. https://arxiv.org/abs/2010.11929.
[21] CHU Xiangxiang, TIAN Zhi, ZHANG Bo, et al. Conditional positional encodings for vision transformers[EB/OL]. (2021-02-22) [2023-01-11]. https://arxiv.org/abs/2102.10882.
[22] HAN Kai, XIAO An, WU Enhua, et al. Transformer in transformer[J]. Advances in neural information processing systems, 2021, 34: 15908-15919.
[23] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.
[24] WANG Wenhai, XIE Enze, LI Xiang, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 548-558.
[25] XIE Enze, WANG Wenhai, YU Zhiding, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[J]. Advances in neural information processing systems, 2021, 34: 12077-12090.
[26] HAN Kai, WANG Yunhe, TIAN Qi, et al. GhostNet: more features from cheap operations[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1577-1586.
[27] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on Robust Reading[C]//2015 13th International Conference on Document Analysis and Recognition. Tunis: IEEE, 2015: 1156-1160.
[28] HE Mengchao, LIU Yuliang, YANG Zhibo, et al. ICPR2018 contest on robust reading for multi-type web images[C]//2018 24th International Conference on Pattern Recognition. Beijing: IEEE, 2018: 7-12.
[29] ZHANG Chongsheng, PENG Guowen, TAO Yuefeng, et al. ShopSign: a diverse scene text dataset of Chinese shop signs in street views[EB/OL]. (2019-03-25)[2023-01-11]. https://arxiv.org/abs/1903.10412.
[30] LONG Shangbang, RUAN Jiaqiang, ZHANG Wenjie, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes[C]//European conference on computer vision. Cham: Springer, 2018: 19-35.
[31] WANG Yuxin, XIE Hongtao, ZHA Zhengjun, et al. ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11750-11759.
[32] ZHANG Shixue, ZHU Xiaobin, HOU Jiebo, et al. Deep relational reasoning graph network for arbitrary shape text detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9696-9705.
[33] ZHU Yiqin, CHEN Jianyong, LIANG Lingyu, et al. Fourier contour embedding for arbitrary-shaped text detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3122-3130.
[34] LIAO Minghui, ZOU Zhisheng, WAN Zhaoyi, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(1): 919-931.
[35] LIU Jinpeng, WU Song, HE Dehong, et al. MS-ROCANet: multi-scale residual orthogonal-channel attention network for scene text detection[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 2200-2204.
[36] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//European conference on computer vision. Cham: Springer, 2018: 122-138.
[37] SANDLER M, HOWARD A, ZHU Menglong, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510-4520.
[38] ZHANG Hang, WU Chongruo, ZHANG Zhongyue, et al. ResNeSt: split-attention networks[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 2735-2745.
[39] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11966-11976.

相似文献/References:: [1]黄剑华,唐降龙,刘家锋,等.一种基于Homogeneity的文本检测新方法[J].智能系统学报,2007,2(1):69.
　HUANG Jian-hua,TANG Xiang-long,LIU Jia-feng,et al.A new method for text detection based on Homogeneity[J].CAAI Transactions on Intelligent Systems,2007,2():69.
[2]赵文清,孔子旭,赵振兵.隔级融合特征金字塔与CornerNet相结合的小目标检测[J].智能系统学报,2021,16(1):108.[doi:10.11992/tis.202004033]
　ZHAO Wenqing,KONG Zixu,ZHAO Zhenbing.Small target detection based on a combination of feature pyramid and CornerNet[J].CAAI Transactions on Intelligent Systems,2021,16():108.[doi:10.11992/tis.202004033]
[3]赵文清,杨盼盼.双向特征融合与注意力机制结合的目标检测[J].智能系统学报,2021,16(6):1098.[doi:10.11992/tis.202012029]
　ZHAO Wenqing,YANG Panpan.Target detection based on bidirectional feature fusion and an attention mechanism[J].CAAI Transactions on Intelligent Systems,2021,16():1098.[doi:10.11992/tis.202012029]
[4]刘光辉,张钰敏,孟月波,等.双分支跨级特征融合的自然场景文本检测[J].智能系统学报,2023,18(5):1079.[doi:10.11992/tis.202303005]
　LIU Guanghui,ZHANG Yumin,MENG Yuebo,et al.Natural scene text detection based on double-branch cross-level feature fusion[J].CAAI Transactions on Intelligent Systems,2023,18():1079.[doi:10.11992/tis.202303005]
[5]曲海成,李瑞柯,王蒙,等.基于特征重用和膨胀卷积的遥感图像舰船检测[J].智能系统学报,2024,19(5):1298.[doi:10.11992/tis.202304002]
　QU Haicheng,LI Ruike,WANG Meng,et al.Ship detection in remote sensing images via feature reuse and dilated convolution[J].CAAI Transactions on Intelligent Systems,2024,19():1298.[doi:10.11992/tis.202304002]

备注/Memo

收稿日期:2023-1-11。
基金项目:中央高校基本科研业务费专项资金项目(2021MS092)；河北省省级科技计划项目(22310302D).
作者简介:张铭泉，副教授，主要研究方向为计算机组成、机器学习、模式识别。发表学术论文20余篇。E-mail：mqzhang@ncepu.edu.cn;张泽恩，硕士研究生，主要研究方向为深度学习和文本检测。E-mail：zze15832206526@163.com;曹锦纲，讲师，主要研究方向为图像处理和模式识别。发表学术论文10余篇。E-mail：caojg168@126.com。
通讯作者:曹锦纲. E-mail：caojg168@126.com

更新日期/Last Update: 2024-09-05

结合Segformer与增强特征金字塔的文本检测方法 PDF下载HTML

备注/Memo

结合Segformer与增强特征金字塔的文本检测方法

PDF下载 HTML