[1]张磊,黄咏秋,李欣,等.弱监督下语言引导的图像分割与定位综述[J].智能系统学报,2025,20(6):1304-1327.[doi:10.11992/tis.202505001]
 ZHANG Lei,HUANG Yongqiu,LI Xin,et al.Review of weakly supervised language-guided image segmentation and grounding[J].CAAI Transactions on Intelligent Systems,2025,20(6):1304-1327.[doi:10.11992/tis.202505001]
点击复制

弱监督下语言引导的图像分割与定位综述

参考文献/References:
[1] HU R, ROHRBACH M, DARRELL T. Segmentation from natural language expressions[C]//Computer Vision–ECCV 2016: 14th European Conference. Amsterdam: Springer, 2016: 108-124.
[2] MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 11-20.
[3] YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]//Computer Vision– ECCV. Cham: Springer International Publishing, 2016: 69-85.
[4] HE Kaiming, GKIOXARI G, DOLL?R P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.
[5] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//IEEE Transactions on Pattern Analysis and Machine Intelligence. Montreal: IEEE, 2017: 1137-1149.
[6] CHEN Haonan, TAN Hao, KUNTZ A, et al. Enabling robots to understand incomplete natural language instructions using commonsense reasoning[C]//2020 IEEE International Conference on Robotics and Automation. Paris: IEEE, 2020: 1963-1969.
[7] GU Jing, STEFANI E, WU Qi, et al. Vision-and-language navigation: a survey of tasks, methods, and future directions[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: USAACL, 2022: 7606-7623.
[8] HU Yutao, WANG Qixiong, SHAO Wenqi, et al. Beyond one-to-one: rethinking the referring image segmentation[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 4044-4054.
[9] YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: modular attention network for referring expression comprehension[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1307-1315.
[10] YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9489-9498.
[11] 邱爽, 赵耀, 韦世奎. 图像指代分割研究综述[J]. 信号处理, 2022, 38(6): 1144-1154.
QIU Shuang, ZHAO Yao, WEI Shikui. A survey of referring image segmentation[J]. Journal of signal processing, 2022, 38(6): 1144-1154.
[12] 项伟康, 周全, 崔景程, 等. 基于深度学习的弱监督语义分割方法综述[J]. 中国图象图形学报, 2024, 29(5): 1146-1168.
XIANG Weikang, ZHOU Quan, CUI Jingcheng, et al. Weakly supervised semantic segmentation based on deep learning[J]. Journal of image and graphics, 2024, 29(5): 1146-1168.
[13] 陈震元, 王振东, 宫辰. 图像级标记弱监督目标检测综述[J]. 中国图象图形学报, 2023, 28(9): 2644-2660.
CHEN Zhenyuan, WANG Zhendong, GONG Chen. Image-level labeled weakly supervised object detection: a survey[J]. Journal of image and graphics, 2023, 28(9): 2644-2660.
[14] 李文生, 张菁, 卓力, 等. 基于Transformer的视觉分割技术进展[J]. 计算机学报, 2024, 47(12): 2760-2782.
LI Wensheng, ZHANG Jing, ZHUO Li, et al. Overview of Transformer-based visual segmentation techniques[J]. Chinese journal of computers, 2024, 47(12): 2760-2782.
[15] 祁磊, 于沛泽, 高阳. 弱监督场景下的行人重识别研究综述[J]. 软件学报, 2020, 31(9): 2883-2902.
QI Lei, YU Peize, GAO Yang. Research on weak-supervised person re-identification[J]. Journal of software, 2020, 31(9): 2883-2902.
[16] 蒋弘毅, 王永娟, 康锦煜. 目标检测模型及其优化方法综述[J]. 自动化学报, 2021, 47(6): 1232-1255.
JIANG Hongyi, WANG Yongjuan, KANG Jinyu. A survey of object detection models and its optimization methods[J]. Acta automatica sinica, 2021, 47(6): 1232-1255.
[17] SHEN Wei, PENG Zelin, WANG Xuehui, et al. A survey on label-efficient deep image segmentation: bridging the gap between weak supervision and dense prediction[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(8): 9284-9305.
[18] LIU Fang, LIU Yuhao, KONG Yuqiu, et al. Referring image segmentation using text supervision[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 22067-22077.
[19] FENG Guang, ZHANG Lihe, HU Zhiwei, et al. Learning from box annotations for referring image segmentation[J]. IEEE transactions on neural networks and learning systems, 2022, 35(3): 3927-3937.
[20] LI Hui, SUN Mingjie, XIAO Jimin, et al. Fully and weakly supervised referring expression segmentation with end-to-end learning[J]. IEEE transactions on circuits and systems for video technology, 2023, 33(10): 5999-6012.
[21] ZANG Ying, CAO Runlong, FU Chenglong, et al. RESMatch: Referring expression segmentation in a semi-supervised manner[J]. Information sciences, 2025, 694: 121709.
[22] QU Mengxue, WU Yu, WEI Yunchao, et al. Learning to segment every referring object point by point[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 3021-3030.
[23] NAG S, GOSWAMI K, KARANAM S. SafaRi: adaptive sequence transformer for weakly supervised referring expression segmentation[C]//Computer Vision–ECCV 2024. Cham: Springer Nature Switzerland, 2024: 485-503.
[24] HUANG Minglang, ZHOU Yiyi, LUO Gen, et al. Towards omni-supervised referring expression segmentation[C]//2024 IEEE International Conference on Multimedia and Expo. Niagara Falls: IEEE, 2024: 1-6.
[25] SUN Jiamu, LUO Gen, ZHOU Yiyi, et al. RefTeacher: a strong baseline for semi-supervised referring expression comprehension[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 19144-19154.
[26] YANG D, JI J, MA Y, et al. SAM as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation[C]//International Conference on Machine Learning. Toronto: PMLR, 2024: 56139-56155.
[27] HU Ronghang, ROHRBACH M, ANDREAS J, et al. Modeling relationships in referential expressions with compositional modular networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 4418-4427.
[28] ZHU Haidong, SADHU A, ZHENG Zhaoheng, et al. Utilizing every image object for semi-supervised phrase grounding[C]//2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021: 2210-2219.
[29] ZHANG Dingwen, HAN Junwei, CHENG Gong, et al. Weakly supervised object localization and detection: a survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 44(9): 5866-5885.
[30] XIAO Fanyi, SIGAL L, LEE Y J. Weakly-supervised visual grounding of phrases with linguistic structures[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 5253-5262.
[31] DONG T. Weakly supervised learning from referring expression: Challenge and directions[D]. Urbana-Champaign: University of Illinois at Urbana-Champaign, 2018.
[32] DIETTERICH T G, LATHROP R H, LOZANO-P?REZ T. Solving the multiple instance problem with axis-parallel rectangles[J]. Artificial intelligence, 1997, 89(1/2): 31-71.
[33] CINBIS R G, VERBEEK J, SCHMID C. Multi-fold MIL training for weakly supervised object localization[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 2409-2416.
[34] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10) [2024-12-12]. https://arxiv.org/pdf/1409.1556.
[35] GRAVES A. Long short-term memory[M/OL]. Supervised sequence labelling with recurrent neural networks. Berlin: Springer Berlin Heidelberg, 2012: 37-45. [2024-12-12]. https://doi.org/10.1007/978-3-642-24797-2_4.
[36] STRUDEL R, LAPTEV I, SCHMID C. Weakly-supervised segmentation of referring expressions[EB/OL]. (2022-05-02) [2024-12-12]. https://arxiv.org/abs/2205.04725.
[37] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc, 2017: 6000-6010.
[38] LEE J, LEE S, NAM J, et al. Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 21813-21824.
[39] KIM D, KIM N, LAN Cuiling, et al. Shatter and gather: learning referring image segmentation with text supervision[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 15501-15511.
[40] LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[J]. Advances in neural information processing systems, 2020, 33: 11525-11538.
[41] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 2021, 34: 9694-9705.
[42] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. International journal of computer vision, 2020, 128(2): 336-359.
[43] ARBEL?EZ P, PONT-TUSET J, BARRON J, et al. Multiscale combinatorial grouping[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 328-335.
[44] CHEN Shengxin, LUO Gen, ZHOU Yiyi, et al. QueryMatch: a query-based contrastive learning framework for weakly supervised visual grounding[C]//Proceedings of the 32nd ACM International Conference on Multimedia. Melbourne: ACM, 2024: 4177-4186.
[45] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020: 213-229.
[46] DAI Qiyuan, YANG Sibei. Curriculum point prompting for weakly-supervised referring image segmentation[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 13711-13722.
[47] EIRAS F, OKSUZ K, BIBI A, et al. Segment, select, correct: a framework for weakly-supervised referring segmentation[EB/OL]. (2023-10-23) [2024-12-12]. https://arxiv.org/abs/2310.13479.
[48] KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 3992-4003.
[49] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. Cambridge: PMLR, 2021: 8748-8763.
[50] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248-255.
[51] LIU Shilong, ZENG Zhaoyang, REN Tianhe, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection[C]//Computer Vision–ECCV 2024. Cham: Springer Nature Switzerland, 2024: 38-55.
[52] WANG Xinlong, YU Zhiding, DE MELLO S, et al. FreeSOLO: learning to segment objects without annotations[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 14156-14166.
[53] YANG Zaiquan, LIU Yuhao, LIN Jiaying, et al. Boosting weakly supervised referring image segmentation via progressive comprehension[C]//The Thirty-eighth Annual Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc, 2024: 93213-93239.
[54] JIANG A Q, SABLAYROLLES A, MENSCH A, et al. Mistral 7B[EB/OL]. (2023-10-10) [2024-12-12]. https://arxiv.org/abs/2310.06825.
[55] ROHRBACH A, ROHRBACH M, HU Ronghang, et al. Grounding of textual phrases in images by reconstruction[C]//Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016: 817-834.
[56] LIU Xuejing, LI Liang, WANG Shuhui, et al. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding[C]//Proceedings of the 27th ACM International Conference on Multimedia. Nice: ACM, 2019: 539-547.
[57] LIU Xuejing, LI Liang, WANG Shuhui, et al. Adaptive reconstruction network for weakly supervised referring expression grounding[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 2611-2620.
[58] ZHAO Fang, LI Jianshu, ZHAO Jian, et al. Weakly supervised phrase localization with multi-scale anchored transformer network[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5696-5705.
[59] SUN Mingjie, XIAO Jimin, LIM E G, et al. Discriminative triad matching and reconstruction for weakly referring expression grounding[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 43(11): 4189-4195.
[60] WANG Ning, SONG Yibing, MA Chao, et al. Unsupervised deep tracking[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1308-1317.
[61] SUN Mingjie, XIAO Jimin, LIM E G, et al. Cycle-free weakly referring expression grounding with self-paced learning[J]. IEEE transactions on multimedia, 2021, 25: 1611-1621.
[62] ZHANG Zhu, ZHAO Zhou, LIN Zhijie, et al. Counterfactual contrastive learning for weakly-supervised vision-language grounding[J]. Advances in neural information processing systems, 2020, 33: 18123-18134.
[63] ZHAO Chenlin, YE Jiabo, SONG Yaguang, et al. Part-aware prompt tuning for weakly supervised referring expression grounding[C]//MultiMedia Modeling. Cham: Springer Nature Switzerland, 2024: 489-502.
[64] ZENG Yan, ZHANG Xinsong, LI Hang. Multi-grained vision language pre-training: aligning texts with visual concepts[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 25994-26009.
[65] JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]//Computer Vision-ECCV 2022. Cham: Springer Nature Switzerland, 2022: 709-727.
[66] ZHANG Panpan, LIU Meng, SONG Xuemeng, et al. Universal relocalizer for weakly supervised referring expression grounding[J]. ACM transactions on multimedia computing, communications, and applications, 2024, 20(7): 1-23.
[67] LIU Yang, ZHANG Jiahua, CHEN Qingchao, et al. Confidence-aware pseudo-label learning for weakly supervised visual grounding[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 2816-2826.
[68] LI Junnan, LI Dongxu, XIONG Caiming, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 12888-12900.
[69] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674-10685.
[70] YU S, SEO P H, SON J. Zero-shot referring image segmentation with global-local context features[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 19456-19465.
[71] LAROCHELLE H, ERHAN D, BENGIO Y. Zero-data learning of new tasks[C]//Proceedings of the 23rd National Conference on Artificial Intelligence-Volume 2. Chicago: AAAI Press, 2008: 646-651.
[72] SUO Yucheng, ZHU Linchao, YANG Yi. Text augmented spatial aware zero-shot referring image segmentation[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: USAACL, 2023: 1032-1043.
[73] LI J, LI D, SAVARESE S, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International Conference on Machine Learning. Honolulu: PMLR, 2023: 19730-19742.
[74] LI Changlong, ZHUANG Jiedong, HU Jiaqi, et al. Zero-shot referring image segmentation with hierarchical prompts and frequency domain fusion[M]//PRICAI 2024: Trends in Artificial Intelligence. Singapore: Springer Nature Singapore, 2024: 215-228.
[75] HUANG Xinyu, HUANG Yijie, ZHANG Youcai, et al. Open-set image tagging with multi-grained text supervision[EB/OL]. (2023-11-16) [2024-12-12]. https://arxiv.org/abs/2310.15200.
[76] KE L, YE M, DANELLJAN M, et al. Segment anything in high quality[J]. Advances in neural information processing systems, 2023, 36: 29914-29934.
[77] DA CUNHA A L, ZHOU Jianping, DO M N. The nonsubsampled contourlet transform: theory, design, and applications[J]. IEEE transactions on image processing, 2006, 15(10): 3089-3101.
[78] CHEN C F, HSIAO C H. Haar wavelet method for solving lumped and distributed-parameter systems[J]. IEE proceedings - control theory and applications, 1997, 144(1): 87-94.
[79] LI Wenhui, PANG Chao, NIE Weizhi, et al. Bidirectional mask selection for zero-shot referring image segmentation[J]. IEEE transactions on circuits and systems for video technology, 2024, 35(1): 911-921.
[80] YU S, SEO P H, SON J. Pseudo-RIS: distinctive pseudo-supervision generation for referring image segmentation[C]//Computer Vision–ECCV 2024. Cham: Springer Nature Switzerland, 2024: 18-36.
[81] YU Jiahui, WANG Zirui, VASUDEVAN V, et al. CoCa: contrastive captioners are image-text foundation models[EB/OL]. (2022-06-14)[2024-12-12]. https://arxiv.org/abs/2205.01917.
[82] NI Minheng, ZHANG Yabo, FENG Kailai, et al. Ref-diff: zero-shot referring image segmentation with generative models[EB/OL]. (2023-09-01)[2024-12-12]. https://arxiv. org/abs/2308.16777.
[83] SHI Hengcan, HAYAT M, CAI Jianfei. Unpaired referring expression grounding via bidirectional cross-modal matching[J]. Neurocomputing, 2023, 518: 39-49.
[84] SUBRAMANIAN S, MERRILL W, DARRELL T, et al. ReCLIP: a strong zero-shot baseline for referring expression comprehension[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: USAACL, 2022: 5198-5215.
[85] HONNIBAL M, JOHNSON M. An improved non-monotonic transition system for dependency parsing[C]//Conference on Empirical Methods in Natural Language Processing, EMNLP 2015. Lisboa: Association for Computational Linguistics (ACL), 2015: 1373-1378.
[86] HAN Zeyu, ZHU Fangrui, LAO Qianru, et al. Zero-shot referring expression comprehension via structural similarity between images and captions[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 14364-14375.
[87] WANG Hanyao, ZHAN Yibing, LIU Liu, et al. Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning[EB/OL]. (2024-06-27)[2024-12-12]. https://arxiv.org/abs/2402.18400.
[88] FLORIDI L, CHIRIATTI M. GPT-3: its nature, scope, limits, and consequences[J]. Minds and machines, 2020, 30(4): 681-694.
[89] QIU Heqian, WANG Lanxiao, ZHAO Taijin, et al. MCCE-REC: MLLM-driven cross-modal contrastive entropy model for zero-shot referring expression comprehension[J]. IEEE transactions on circuits and systems for video technology, 2025, 35(1): 754-768.
[90] JIANG Haojun, LIN Yuanze, HAN Dongchen, et al. Pseudo-Q: generating pseudo language queries for visual grounding[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15492-15502.
[91] ANDERSON P, HE Xiaodong, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
[92] WU Cantao, CAI Yi, LI Liuwu, et al. Scene graph enhanced pseudo-labeling for referring expression comprehension[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: USAACL, 2023: 11978-11990.
[93] ZHANG Ao, YAO Yuan, CHEN Qianyu, et al. Fine-grained scene graph generation with data transfer[C]//Computer Vision-ECCV 2022. Cham: Springer Nature Switzerland, 2022: 409-424.
[94] LIN S, HILTON J, EVANS O. Teaching models to express their uncertainty in words[J/OL]. Transactions on Machine Learning Research. [2024-12-12]. https://openreview.net/forum?id=8s8K2UZGTZ.
[95] LIU Xuyang, HUANG Siteng, KANG Yachen, et al. VGDIFFZERO: text-to-image diffusion models can be zero-shot visual grounders[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul: IEEE, 2024: 2765-2769.
[96] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Cham: Springer International Publishing, 2015: 234-241.
[97] NAGARAJA V K, MORARIU V I, DAVIS L S. Modeling context between objects for referring expression understanding[C]//Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016: 792-807.
[98] KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: referring to objects in photographs of natural scenes[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: USAACL, 2014: 787-798.
[99] GRUBINGER M, CLOUGH P D, M?LLER H, et al. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems[C]//International workshop onto Image. Genoa: LREC, 2006: 13-23.
[100] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Computer Vision–ECCV 2014. Cham: Springer International Publishing, 2014: 740-755.
[101] ZHAI Xiaohua, KOLESNIKOV A, HOULSBY N, et al. Scaling vision transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1204-1213.
[102] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C/OL]. International Conference on Learning Representations. (2021-06-03) [2024-12-12]. https://openreview.net/forum?id=YicbFdNTTy.
[103] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[104] CHENG Bowen, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280-1289.
相似文献/References:
[1]夏 凡,王 宏.基于局部异常行为检测的欺骗识别研究[J].智能系统学报,2007,2(5):12.
 XIA Fan,WANG Hong.Methodologies for deception detection based on abnormal b ehavior[J].CAAI Transactions on Intelligent Systems,2007,2():12.
[2]杨 戈,刘 宏.视觉跟踪算法综述[J].智能系统学报,2010,5(2):95.
 YANG Ge,LIU Hong.Survey of visual tracking algorithms[J].CAAI Transactions on Intelligent Systems,2010,5():95.
[3]刘宏,李哲媛,许超.视错觉现象的分类和研究进展[J].智能系统学报,2011,6(1):1.
 LIU Hong,LI Zheyuan,XU Chao.The categories and research advances of visual illusions[J].CAAI Transactions on Intelligent Systems,2011,6():1.
[4]叶果,程洪,赵洋.电影中吸烟活动识别[J].智能系统学报,2011,6(5):440.
 YE Guo,CHENG Hong,ZHAO Yang.moking recognition in movies[J].CAAI Transactions on Intelligent Systems,2011,6():440.
[5]史晓鹏,何为,韩力群.采用Hough变换的道路边界检测算法[J].智能系统学报,2012,7(1):81.
 SHI Xiaopeng,HE Wei,HAN Liqun.A road edge detection algorithm based on the Hough transform[J].CAAI Transactions on Intelligent Systems,2012,7():81.
[6]张媛媛,霍静,杨婉琪,等.深度信念网络的二代身份证异构人脸核实算法[J].智能系统学报,2015,10(2):193.[doi:10.3969/j.issn.1673-4785.201405060]
 ZHANG Yuanyuan,HUO Jing,YANG Wanqi,et al.A deep belief network-based heterogeneous face verification method for the second-generation identity card[J].CAAI Transactions on Intelligent Systems,2015,10():193.[doi:10.3969/j.issn.1673-4785.201405060]
[7]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报,2015,10(1):1.[doi:10.3969/j.issn.1673-4785.201403072]
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10():1.[doi:10.3969/j.issn.1673-4785.201403072]
[8]顾照鹏,刘宏.单目视觉同步定位与地图创建方法综述[J].智能系统学报,2015,10(4):499.[doi:10.3969/j.issn.1673-4785.201503003]
 GU Zhaopeng,LIU Hong.A survey of monocular simultaneous localization and mapping[J].CAAI Transactions on Intelligent Systems,2015,10():499.[doi:10.3969/j.issn.1673-4785.201503003]
[9]赵军,於俊,汪增福.基于改进逆向运动学的人体运动跟踪[J].智能系统学报,2015,10(4):548.[doi:10.3969/j.issn.1673-4785.201403032]
 ZHAO Jun,YU Jun,WANG Zengfu.Human motion tracking based on an improved inverse kinematics[J].CAAI Transactions on Intelligent Systems,2015,10():548.[doi:10.3969/j.issn.1673-4785.201403032]
[10]姬晓飞,王昌汇,王扬扬.分层结构的双人交互行为识别方法[J].智能系统学报,2015,10(6):893.[doi:10.11992/tis.201505006]
 JI Xiaofei,WANG Changhui,WANG Yangyang.Human interaction behavior-recognition method based on hierarchical structure[J].CAAI Transactions on Intelligent Systems,2015,10():893.[doi:10.11992/tis.201505006]
[11]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报,2018,13(1):55.[doi:10.11992/tis.201609029]
 WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13():55.[doi:10.11992/tis.201609029]
[12]周彦,李雅芳,王冬丽,等.视觉同时定位与地图创建综述[J].智能系统学报,2018,13(1):97.[doi:10.11992/tis.201703006]
 ZHOU Yan,LI Yafang,WANG Dongli,et al.A survey of VSLAM[J].CAAI Transactions on Intelligent Systems,2018,13():97.[doi:10.11992/tis.201703006]
[13]孙必慎,石武祯,姜峰.计算视觉核心问题:自然图像先验建模研究综述[J].智能系统学报,2019,14(1):71.[doi:10.11992/tis.201804019]
 SUN Bishen,SHI Wuzhen,JIANG Feng.Core problem in computer vision: survey of natural image prior models[J].CAAI Transactions on Intelligent Systems,2019,14():71.[doi:10.11992/tis.201804019]
[14]刘彪,黄蓉蓉,林和,等.基于卷积神经网络的盲文音乐识别研究[J].智能系统学报,2019,14(1):186.[doi:10.11992/tis.201805002]
 LIU Biao,HUANG Rongrong,LIN He,et al.Research on braille music recognition based on convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2019,14():186.[doi:10.11992/tis.201805002]
[15]朱文霖,刘华平,王博文,等.基于视-触跨模态感知的智能导盲系统[J].智能系统学报,2020,15(1):33.[doi:10.11992/tis.201908015]
 ZHU Wenlin,LIU Huaping,WANG Bowen,et al.An intelligent blind guidance system based on visual-touch cross-modal perception[J].CAAI Transactions on Intelligent Systems,2020,15():33.[doi:10.11992/tis.201908015]
[16]张新钰,邹镇洪,李志伟,等.面向自动驾驶目标检测的深度多模态融合技术[J].智能系统学报,2020,15(4):758.[doi:10.11992/tis.202002010]
 ZHANG Xinyu,ZOU Zhenhong,LI Zhiwei,et al.Deep multi-modal fusion in object detection for autonomous driving[J].CAAI Transactions on Intelligent Systems,2020,15():758.[doi:10.11992/tis.202002010]
[17]王凯诚,鲁华祥,龚国良,等.基于注意力机制的显著性目标检测方法[J].智能系统学报,2020,15(5):956.[doi:10.11992/tis.201903001]
 WANG Kaicheng,LU Huaxiang,GONG Guoliang,et al.Salient object detection method based on the attention mechanism[J].CAAI Transactions on Intelligent Systems,2020,15():956.[doi:10.11992/tis.201903001]
[18]王照国,张红云,苗夺谦.基于F1值的非极大值抑制阈值自动选取方法[J].智能系统学报,2020,15(5):1006.[doi:10.11992/tis.202006056]
 WANG Zhaoguo,ZHANG Hongyun,MIAO Duoqian.Automatic selection method of non-maximum suppression threshold based on F1 score[J].CAAI Transactions on Intelligent Systems,2020,15():1006.[doi:10.11992/tis.202006056]
[19]付常洋,王瑜,肖洪兵,等.基于深度学习与结构磁共振成像的抑郁症辅助诊断[J].智能系统学报,2021,16(3):544.[doi:10.11992/tis.201912006]
 FU Changyang,WANG Yu,XIAO Hongbing,et al.Assisted diagnosis of major depression disorder using deep learning and structural magnetic resonance imaging[J].CAAI Transactions on Intelligent Systems,2021,16():544.[doi:10.11992/tis.201912006]
[20]董俊杰,刘华平,谢珺,等.基于反馈注意力机制和上下文融合的非模式实例分割[J].智能系统学报,2021,16(4):801.[doi:10.11992/tis.202007042]
 DONG Junjie,LIU Huaping,XIE Jun,et al.Feedback attention mechanism and context fusion based amodal instance segmentation[J].CAAI Transactions on Intelligent Systems,2021,16():801.[doi:10.11992/tis.202007042]

备注/Memo

收稿日期:2025-5-6。
基金项目:国家自然科学基金项目(62476064);广东省自然科学基金项目(2024A1515010455).
作者简介:张磊,教授,中国人工智能学会认知系统与信息处理专委会委员,主要研究方向为机器学习、计算机视觉。主持国家自然科学基金项目4项,近5年发表学术论文30余篇(包括研究领域的顶刊和顶会)。E-mail:zhanglei@gdupt.edu.cn。;黄咏秋,硕士研究生,主要研究方向为计算机视觉。E-mail:smile_n_n@163.com。;李欣,副教授,主要研究方向为信号及图像处理、计算机视觉与机器人。E-mail:lixin@gdupt.edu.cn。
通讯作者:李欣. E-mail:lixin@gdupt.edu.cn

更新日期/Last Update: 1900-01-01
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com