[1]张磊,黄咏秋,李欣,等.弱监督下语言引导的图像分割与定位综述[J].智能系统学报,2025,20(6):1304-1327.[doi:10.11992/tis.202505001]
ZHANG Lei,HUANG Yongqiu,LI Xin,et al.Review of weakly supervised language-guided image segmentation and grounding[J].CAAI Transactions on Intelligent Systems,2025,20(6):1304-1327.[doi:10.11992/tis.202505001]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
20
期数:
2025年第6期
页码:
1304-1327
栏目:
综述
出版日期:
2025-11-05
- Title:
-
Review of weakly supervised language-guided image segmentation and grounding
- 作者:
-
张磊1, 黄咏秋2, 李欣2, 王宝艳2
-
1. 广东石油化工学院 电子信息工程学院, 广东 茂名 525000;
2. 广东石油化工学院 计算机学院, 广东 茂名 525000
- Author(s):
-
ZHANG Lei1, HUANG Yongqiu2, LI Xin2, WANG Baoyan2
-
1. School of Electronic Information Engineering, Guangdong University of Petrochemical Technology, Maoming 525000, China;
2. School of Computer Science, Guangdong University of Petrochemical Technology, Maoming 525000, China
-
- 关键词:
-
深度学习; 计算机视觉; 弱监督学习; 无监督学习; 指代图像分割; 指代表达定位; 多模态; 大语言模型
- Keywords:
-
deep learning; computer vision; weakly supervised learning; unsupervised learning; referring image segmentation; referring expression grounding; multimodal; large language model
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202505001
- 摘要:
-
语言引导的图像分割(referring image segmentation, RIS)与定位(referring expression grounding, REG)旨在根据自然语言指令预测目标的掩码或边界框,是视觉-语言理解的重要任务。完全监督方法因标注成本高受限,促使弱监督学习成为研究热点。对此,从统一视角梳理弱监督RIS与REG研究进展,重点介绍仅依赖图像-文本对及无标注数据的方法,并探讨现存问题与未来方向。介绍RIS与REG任务背景,分析弱监督学习的价值与挑战;归纳不同类型的弱监督信号,分类综述代表性方法并分析其特点;介绍主流数据集与评价指标,并比较典型方法性能。研究表明,引入多模态大语言模型等预训练模型可显著提升性能,但仍受限于预训练模型的局限性与任务适配性。未来,优化跨模态细粒度对齐、模型效率与泛化能力将是该领域的重要研究方向。
- Abstract:
-
Language-guided image segmentation (referring image segmentation, RIS) and grounding (referring expression grounding, REG) aim to predict masks or bounding boxes for target objects based on natural language instructions, serving as key tasks in vision-language understanding. Fully supervised methods are constrained by high annotation costs, driving increasing interest in weakly supervised learning. This paper reviewed recent advances in weakly supervised RIS and REG from a unified perspective, focused on methods based on image-text pairs and unlabeled data, and discussed current challenges and future directions. It introduced the background of RIS and REG and analyzed the value and challenges of weak supervision. It summarized different types of weak supervision signals, categorized representative methods, and analyzed their characteristics. It presented mainstream datasets and evaluation metrics, and compared the performance of typical methods. Studies showed that incorporating pretrained models, such as large language models, can significantly improve performance. However, limitations due to the constraints of pretrained models and task adaptation remain. In the future, optimizing fine-grained cross-modal alignment, model efficiency, and generalization ability will be important research directions.
备注/Memo
收稿日期:2025-5-6。
基金项目:国家自然科学基金项目(62476064);广东省自然科学基金项目(2024A1515010455).
作者简介:张磊,教授,中国人工智能学会认知系统与信息处理专委会委员,主要研究方向为机器学习、计算机视觉。主持国家自然科学基金项目4项,近5年发表学术论文30余篇(包括研究领域的顶刊和顶会)。E-mail:zhanglei@gdupt.edu.cn。;黄咏秋,硕士研究生,主要研究方向为计算机视觉。E-mail:smile_n_n@163.com。;李欣,副教授,主要研究方向为信号及图像处理、计算机视觉与机器人。E-mail:lixin@gdupt.edu.cn。
通讯作者:李欣. E-mail:lixin@gdupt.edu.cn
更新日期/Last Update:
1900-01-01