[1]王文博,张志飞,王睿智,等.基于聚类重组和预解析的检索增强生成方法[J].智能系统学报,2026,21(1):236-244.[doi:10.11992/tis.202506029]
WANG Wenbo,ZHANG Zhifei,WANG Ruizhi,et al.Retrieval-augmented generation based on cluster reorganization and pre-parsing[J].CAAI Transactions on Intelligent Systems,2026,21(1):236-244.[doi:10.11992/tis.202506029]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
21
期数:
2026年第1期
页码:
236-244
栏目:
吴文俊人工智能科学技术奖论坛
出版日期:
2026-03-05
- Title:
-
Retrieval-augmented generation based on cluster reorganization and pre-parsing
- 作者:
-
王文博1, 张志飞2, 王睿智1, 苗夺谦1
-
1. 同济大学 计算机科学与技术学院, 上海 201804;
2. 同济大学 国家海底科学观测系统项目办公室, 上海 200092
- Author(s):
-
WANG Wenbo1, ZHANG Zhifei2, WANG Ruizhi1, MIAO Duoqian1
-
1. School of Computer Science and Technology, Tongji University, Shanghai 201804, China;
2. Project Management Office of China National Scientific Seafloor Observatory, Tongji University, Shanghai 200092, China
-
- 关键词:
-
深度学习; 自然语言处理; 大语言模型; 向量检索; 自动问答; 检索增强生成; 聚类算法; 提示工程
- Keywords:
-
deep learning; natural language processing; large language models; vector retrieval; question answering; retrieval-augmented generation; clustering algorithms; prompt engineering
- 分类号:
-
TP311.1
- DOI:
-
10.11992/tis.202506029
- 摘要:
-
检索增强生成(retrieval-augmented generation, RAG)技术因具有为大语言模型(large language model, LLM)提供模型外知识的能力而受到人们的关注,然而绝大多数方法都难以同时兼顾局部的细节知识和原文中不连续的多跳知识。针对上述问题,提出基于聚类重组和预解析的检索增强生成方法。在索引阶段,首先通过聚类算法将不连续的相关知识组合成新分块,以提高多跳知识的检索能力;然后基于提示工程对各知识分块进行预解析生成更细粒度的新分块,以提高检索阶段的召回率。在检索阶段,将召回的所有新分块还原为原文分块,并连同查询语句输入给大语言模型以得到最终答案。在数据集QuALITY上对所提出的方法进行了评估,通过消融实验和开源基线对比实验验证了方法的有效性,并在公开的评测排行榜上取得了最佳效果。本文分析结果可为RAG的索引和检索技术提供参考。
- Abstract:
-
Retrieval-augmented generation(RAG) has garnered remarkable attention for its ability to provide external knowledge to large language models(LLM). However, existing RAG methods often struggle to simultaneously capture both local detailed knowledge and non-contiguous multi-hop knowledge within the original text. To address this issue, this study proposes a novel RAG method based on cluster reorganization and pre-parsing. In the indexing stage, clustering algorithms are used to group discontinuous but relevant knowledge into new chunks, enhancing the retrieval of multi-hop information. Furthermore, prompt engineering is applied to pre-parse these chunks, dividing them into finer-grained sub-units to improve recall during retrieval. In the retrieval stage, all retrieved chunks are restored to their original context blocks and, together with the query, are fed into the LLM to generate the final answer. Ablation and comparative experiments conducted on the QuALITY dataset demonstrate the effectiveness of the proposed method, achieving the best performance on the public leaderboard. The findings of this study provide valuable insights for improving indexing and retrieval technologies in RAG.
备注/Memo
收稿日期:2025-6-25。
基金项目:国家重点研发计划项目(2022YFB3104702);上海市自然科学基金项目(22ZR1466700).
作者简介:王文博,硕士研究生,主要研究方向为深度学习与向量检索。E-mail: wang.wenbo.top@qq.com。;张志飞,博士,博士生导师,中国人工智能学会粒计算与知识发现专业委员会委员,上海市计算机学会计算机视觉专业委员会秘书长,主要研究方向为模式识别与大数据挖掘。主持国家自然科学基金、上海市自然科学基金等项目,获吴文俊人工智能自然 科学奖二等奖。发表学术论文30余 篇。E-mail:zhifeizhang@tongji.edu.cn。;王睿智,副教授,博士生导师,中国人工智能学会粒计算与知识发现专业委员会委员,主要研究方向为深度学习与粒计算。获吴文俊人工智能自然科学奖二等奖。发表学术论文50余篇。E-mail:ruizhiwang@tongji.edu.cn。
通讯作者:张志飞. E-mail:zhifeizhang@tongji.edu.cn
更新日期/Last Update:
2026-01-05