[1]ZHANG Lei,HUANG Yongqiu,LI Xin,et al.Review of weakly supervised language-guided image segmentation and grounding[J].CAAI Transactions on Intelligent Systems,2025,20(6):1304-1327.[doi:10.11992/tis.202505001]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
20
Number of periods:
2025 6
Page number:
1304-1327
Column:
综述
Public date:
2025-11-05
- Title:
-
Review of weakly supervised language-guided image segmentation and grounding
- Author(s):
-
ZHANG Lei1; HUANG Yongqiu2; LI Xin2; WANG Baoyan2
-
1. School of Electronic Information Engineering, Guangdong University of Petrochemical Technology, Maoming 525000, China;
2. School of Computer Science, Guangdong University of Petrochemical Technology, Maoming 525000, China
-
- Keywords:
-
deep learning; computer vision; weakly supervised learning; unsupervised learning; referring image segmentation; referring expression grounding; multimodal; large language model
- CLC:
-
TP391
- DOI:
-
10.11992/tis.202505001
- Abstract:
-
Language-guided image segmentation (referring image segmentation, RIS) and grounding (referring expression grounding, REG) aim to predict masks or bounding boxes for target objects based on natural language instructions, serving as key tasks in vision-language understanding. Fully supervised methods are constrained by high annotation costs, driving increasing interest in weakly supervised learning. This paper reviewed recent advances in weakly supervised RIS and REG from a unified perspective, focused on methods based on image-text pairs and unlabeled data, and discussed current challenges and future directions. It introduced the background of RIS and REG and analyzed the value and challenges of weak supervision. It summarized different types of weak supervision signals, categorized representative methods, and analyzed their characteristics. It presented mainstream datasets and evaluation metrics, and compared the performance of typical methods. Studies showed that incorporating pretrained models, such as large language models, can significantly improve performance. However, limitations due to the constraints of pretrained models and task adaptation remain. In the future, optimizing fine-grained cross-modal alignment, model efficiency, and generalization ability will be important research directions.