<-上一篇/Previous Article 下一篇/Next Article->

[1]王文博,张志飞,王睿智,等.基于聚类重组和预解析的检索增强生成方法[J].智能系统学报,2026,21(1):236-244.[doi:10.11992/tis.202506029]
　WANG Wenbo,ZHANG Zhifei,WANG Ruizhi,et al.Retrieval-augmented generation based on cluster reorganization and pre-parsing[J].CAAI Transactions on Intelligent Systems,2026,21(1):236-244.[doi:10.11992/tis.202506029]

点击复制

基于聚类重组和预解析的检索增强生成方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 21 期数: 2026年第1期页码: 236-244 栏目: 吴文俊人工智能科学技术奖论坛出版日期: 2026-01-05

Title:: Retrieval-augmented generation based on cluster reorganization and pre-parsing

作者:: 王文博¹, 张志飞², 王睿智¹, 苗夺谦¹; 1. 同济大学计算机科学与技术学院, 上海 201804;
2. 同济大学国家海底科学观测系统项目办公室, 上海 200092

Author(s):: WANG Wenbo¹, ZHANG Zhifei², WANG Ruizhi¹, MIAO Duoqian¹; 1. School of Computer Science and Technology, Tongji University, Shanghai 201804, China;
2. Project Management Office of China National Scientific Seafloor Observatory, Tongji University, Shanghai 200092, China

关键词:: 深度学习; 自然语言处理; 大语言模型; 向量检索; 自动问答; 检索增强生成; 聚类算法; 提示工程

Keywords:: deep learning; natural language processing; large language models; vector retrieval; question answering; retrieval-augmented generation; clustering algorithms; prompt engineering

分类号:: TP311.1

DOI:: 10.11992/tis.202506029

摘要:: 检索增强生成(retrieval-augmented generation, RAG)技术因具有为大语言模型(large language model, LLM)提供模型外知识的能力而受到人们的关注，然而绝大多数方法都难以同时兼顾局部的细节知识和原文中不连续的多跳知识。针对上述问题，提出基于聚类重组和预解析的检索增强生成方法。在索引阶段，首先通过聚类算法将不连续的相关知识组合成新分块，以提高多跳知识的检索能力；然后基于提示工程对各知识分块进行预解析生成更细粒度的新分块，以提高检索阶段的召回率。在检索阶段，将召回的所有新分块还原为原文分块，并连同查询语句输入给大语言模型以得到最终答案。在数据集QuALITY上对所提出的方法进行了评估，通过消融实验和开源基线对比实验验证了方法的有效性，并在公开的评测排行榜上取得了最佳效果。本文分析结果可为RAG的索引和检索技术提供参考。

Abstract:: Retrieval-augmented generation(RAG) has garnered remarkable attention for its ability to provide external knowledge to large language models(LLM). However, existing RAG methods often struggle to simultaneously capture both local detailed knowledge and non-contiguous multi-hop knowledge within the original text. To address this issue, this study proposes a novel RAG method based on cluster reorganization and pre-parsing. In the indexing stage, clustering algorithms are used to group discontinuous but relevant knowledge into new chunks, enhancing the retrieval of multi-hop information. Furthermore, prompt engineering is applied to pre-parse these chunks, dividing them into finer-grained sub-units to improve recall during retrieval. In the retrieval stage, all retrieved chunks are restored to their original context blocks and, together with the query, are fed into the LLM to generate the final answer. Ablation and comparative experiments conducted on the QuALITY dataset demonstrate the effectiveness of the proposed method, achieving the best performance on the public leaderboard. The findings of this study provide valuable insights for improving indexing and retrieval technologies in RAG.

参考文献/References:: [1] 吴国栋, 秦辉, 胡全兴, 等. 大语言模型及其个性化推荐研究[J]. 智能系统学报, 2024, 19(6): 1351-1365 WU Guodong, QIN Hui, HU Quanxing, et al. Research on large language models and personalized recommendation[J]. CAAI transactions on intelligent systems, 2024, 19(6): 1351-1365
[2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901
[3] BAHRINI A, KHAMOSHIFAR M, ABBASIMEHR H, et al. ChatGPT: applications, opportunities, and threats[C]//2023 Systems and Information Engineering Design Symposium. Charlottesville: IEEE, 2023.
[4] BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[EB/OL]. (2023-09-28)[2025-02-23]. http://arxiv.org/abs/2309.16609.
[5] YANG An, YANG Baosong, ZHANG Beichen et al. Qwen2.5 technical report[EB/OL]. (2025-01-03)[2025-02-22]. http://arxiv.org/abs/2412.15115.
[6] GUO Daya, YANG Dejian, ZHANG Haowei, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning[J]. Nature, 2025, 645(8081): 633-638
[7] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140): 1-67
[8] HUANG Lei, YU Weijiang, MA Weitao, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. ACM transactions on information systems, 2025, 43(2): 1-55
[9] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in neural information processing systems, 2020, 33: 9459-9474
[10] GAO Yunfan, XIONG Yun, GAO Xinyu, et al. Retrieval-augmented generation for large language models: a survey[EB/OL]. (2023-12-29)[2024-01-02]. http://arxiv.org/abs/2312.10997.
[11] 邹佰翰, 汪莹, 彭鑫, 等. 重新审视代码补全中的检索增强策略[J]. 软件学报, 2025, 36(6): 2747-2773 ZOU Baihan, WANG Ying, PENG Xin, et al. Revisiting retrieval-augmentation strategy in code completion[J]. Journal of software, 2025, 36(6): 2747-2773
[12] 田萱, 吴志超. 基于信息检索的知识库问答综述[J]. 计算机研究与发展, 2025, 62(2): 314-335 TIAN Xuan, WU Zhichao. Review of knowledge base question answering based on information retrieval[J]. Journal of computer research and development, 2025, 62(2): 314-335
[13] PANG R Y, PARRISH A, JOSHI N, et al. QuALITY: question answering with long input texts, yes![C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: USAACL, 2022.
[14] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: SIGDAT, 2020.
[15] 邸剑, 刘骏华, 曹锦纲. 利用BERT和覆盖率机制改进的HiNT文本检索模型[J]. 智能系统学报, 2024, 19(3): 719-727 DI Jian, LIU Junhua, CAO Jingang. An improved HiNT text retrieval model using BERT and coverage mechanism[J]. CAAI transactions on intelligent systems, 2024, 19(3): 719-727
[16] SARTHI P, ABDULLAH S, TULI A, et al. RAPTOR: recursive abstractive processing for tree-organized retrieval[C]//The Twelfth International Conference on Learning Representations. Vienna: ICLR, 2024.
[17] RAINA V, GALES M. Question-based retrieval using atomic units for enterprise RAG[C]//Proceedings of the Seventh Fact Extraction and VERification Workshop. Miami: ACL, 2024.
[18] G?NTHER M, MOHR I, WILLIAMS D J, et al. Late chunking: contextual chunk embeddings using long-context embedding models[EB/OL]. (2024-10-02) [2024-10-10]. http://arxiv.org/abs/2409.04701.
[19] IZACARD G, CARON M, HOSSEINI L, et al. Unsupervised dense information retrieval with contrastive learning[J]. Transactions on machine learning research, 2022, 2022: 1-6
[20] YE Qinyuan, BELTAGY I, PETERS M, et al. FiD-ICL: a fusion-in-decoder approach for efficient in-context learning[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto: USAACL, 2023.
[21] MA Xinbei, GONG Yeyun, HE Pengcheng, et al. Query rewriting in retrieval-augmented large language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: USAACL, 2023.
[22] SHAO Zhihong, GONG Yeyun, SHEN Yelong, et al. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy[C]//Conference on Empirical Methods in Natural Language Processing(Findings). Singapor: ACL, 2023.
[23] WANG Liang, YANG Nan, WEI Furu. Query2doc: query expansion with large language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: SIGDAT, 2023.
[24] GAO Luyu, MA Xueguang, LIN J, et al. Precise zero-shot dense retrieval without relevance labels[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto: USAACL, 2023.
[25] GLASS M, ROSSIELLO G, CHOWDHURY M F M, et al. Re2G: retrieve, rerank, generate[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: USAACL, 2022.
[26] ZHUANG Shengyao, LIU Bing, KOOPMAN B, et al. Open-source large language models are strong zero-shot query likelihood models for document ranking[C]//Conference on Empirical Methods in Natural Language Processing(Findings). Singapore: ACL, 2023.
[27] 余润杰, 阳羽凡, 周健, 等. 面向海量数据的高效流水化检索增强生成系统[J]. 中国科学: 信息科学, 2025, 55(3): 542-558 YU Runjie, YANG Yufan, ZHOU Jian, et al. Efficient pipeline for retrieval-augmented generation system under big data[J]. Scientia sinica informationis, 2025, 55(3): 542-558
[28] 吴文隆, 尹海莲, 王宁, 等. 大语言模型和知识图谱协同的跨域异质数据查询框架[J]. 计算机研究与发展, 2025, 62(3): 605-619 WU Wenlong, YIN Hailian, WANG Ning, et al. A synergetic LLM-KG framework for cross-domain heterogeneous data query[J]. Journal of computer research and development, 2025, 62(3): 605-619
[29] HEALY J, MCINNES L. Uniform manifold approximation and projection[J]. Nature reviews methods primers, 2024, 4: 82
[30] KWON W, LI Zhuohan, ZHUANG Siyuan, et al. Efficient memory management for large language model serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz: ACM, 2023.

相似文献/References:: [1]李蕾,周延泉,钟义信.基于语用的自然语言处理研究与应用初探[J].智能系统学报,2006,1(2):1.
　LI Lei,ZHOU Yan-quan,ZHONG Yi-xin.Pragmatic Information Based NLP Research and Application[J].CAAI Transactions on Intelligent Systems,2006,1():1.
[2]张媛媛,霍静,杨婉琪,等.深度信念网络的二代身份证异构人脸核实算法[J].智能系统学报,2015,10(2):193.[doi:10.3969/j.issn.1673-4785.201405060]
　ZHANG Yuanyuan,HUO Jing,YANG Wanqi,et al.A deep belief network-based heterogeneous face verification method for the second-generation identity card[J].CAAI Transactions on Intelligent Systems,2015,10():193.[doi:10.3969/j.issn.1673-4785.201405060]
[3]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报,2015,10(1):1.[doi:10.3969/j.issn.1673-4785.201403072]
　DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10():1.[doi:10.3969/j.issn.1673-4785.201403072]
[4]马晓,张番栋,封举富.基于深度学习特征的稀疏表示的人脸识别方法[J].智能系统学报,2016,11(3):279.[doi:10.11992/tis.201603026]
　MA Xiao,ZHANG Fandong,FENG Jufu.Sparse representation via deep learning features based face recognition method[J].CAAI Transactions on Intelligent Systems,2016,11():279.[doi:10.11992/tis.201603026]
[5]刘帅师,程曦,郭文燕,等.深度学习方法研究新进展[J].智能系统学报,2016,11(5):567.[doi:10.11992/tis.201511028]
　LIU Shuaishi,CHENG Xi,GUO Wenyan,et al.Progress report on new research in deep learning[J].CAAI Transactions on Intelligent Systems,2016,11():567.[doi:10.11992/tis.201511028]
[6]马世龙,乌尼日其其格,李小平.大数据与深度学习综述[J].智能系统学报,2016,11(6):728.[doi:10.11992/tis.201611021]
　MA Shilong,WUNIRI Qiqige,LI Xiaoping.Deep learning with big data: state of the art and development[J].CAAI Transactions on Intelligent Systems,2016,11():728.[doi:10.11992/tis.201611021]
[7]王亚杰,邱虹坤,吴燕燕,等.计算机博弈的研究与发展[J].智能系统学报,2016,11(6):788.[doi:10.11992/tis.201609006]
　WANG Yajie,QIU Hongkun,WU Yanyan,et al.Research and development of computer games[J].CAAI Transactions on Intelligent Systems,2016,11():788.[doi:10.11992/tis.201609006]
[8]黄心汉.A3I:21世纪科技之光[J].智能系统学报,2016,11(6):835.[doi:10.11992/tis.201605022]
　HUANG Xinhan.A3I: the star of science and technology for the 21st century[J].CAAI Transactions on Intelligent Systems,2016,11():835.[doi:10.11992/tis.201605022]
[9]李德毅.AI——人类社会发展的加速器[J].智能系统学报,2017,12(5):583.[doi:10.11992/tis.201710016]
　LI Deyi.Artificial intelligence:an accelerator for the development of human society[J].CAAI Transactions on Intelligent Systems,2017,12():583.[doi:10.11992/tis.201710016]
[10]陈培,景丽萍.融合语义信息的矩阵分解词向量学习模型[J].智能系统学报,2017,12(5):661.[doi:10.11992/tis.201706012]
　CHEN Pei,JING Liping.Word representation learning model using matrix factorization to incorporate semantic information[J].CAAI Transactions on Intelligent Systems,2017,12():661.[doi:10.11992/tis.201706012]
[11]王一成,万福成,马宁.融合多层次特征的中文语义角色标注[J].智能系统学报,2020,15(1):107.[doi:10.11992/tis.201910012]
　WANG Yicheng,WAN Fucheng,MA Ning.Chinese semantic role labeling with multi-level linguistic features[J].CAAI Transactions on Intelligent Systems,2020,15():107.[doi:10.11992/tis.201910012]
[12]毛明毅,吴晨,钟义信,等.加入自注意力机制的BERT命名实体识别模型[J].智能系统学报,2020,15(4):772.[doi:10.11992/tis.202003003]
　MAO Mingyi,WU Chen,ZHONG Yixin,et al.BERT named entity recognition model with self-attention mechanism[J].CAAI Transactions on Intelligent Systems,2020,15():772.[doi:10.11992/tis.202003003]
[13]于润羽,李雅文,李昂.融合领域特征的科技学术会议语义相似性计算方法[J].智能系统学报,2022,17(4):737.[doi:10.11992/tis.202203050]
　YU Runyu,LI Yawen,LI Ang.Semantic similarity computing for scientific and technological conferences[J].CAAI Transactions on Intelligent Systems,2022,17():737.[doi:10.11992/tis.202203050]
[14]杜永萍,赵以梁,阎婧雅,等.基于深度学习的机器阅读理解研究综述[J].智能系统学报,2022,17(6):1074.[doi:10.11992/tis.202107024]
　DU Yongping,ZHAO Yiliang,YAN Jingya,et al.Survey of machine reading comprehension based on deep learning[J].CAAI Transactions on Intelligent Systems,2022,17():1074.[doi:10.11992/tis.202107024]
[15]朱超杰,闫昱名,初宝昌,等.采用目标注意力的方面级多模态情感分析研究[J].智能系统学报,2024,19(6):1562.[doi:10.11992/tis.202404009]
　ZHU Chaojie,YAN Yuming,CHU Baochang,et al.Aspect-level multimodal sentiment analysis via object-attention[J].CAAI Transactions on Intelligent Systems,2024,19():1562.[doi:10.11992/tis.202404009]

备注/Memo

收稿日期:2025-6-25。
基金项目:国家重点研发计划项目(2022YFB3104702)；上海市自然科学基金项目(22ZR1466700).
作者简介:王文博，硕士研究生，主要研究方向为深度学习与向量检索。E-mail: wang.wenbo.top@qq.com。;张志飞，博士，博士生导师，中国人工智能学会粒计算与知识发现专业委员会委员，上海市计算机学会计算机视觉专业委员会秘书长，主要研究方向为模式识别与大数据挖掘。主持国家自然科学基金、上海市自然科学基金等项目，获吴文俊人工智能自然科学奖二等奖。发表学术论文30余篇。E-mail：zhifeizhang@tongji.edu.cn。;王睿智，副教授，博士生导师，中国人工智能学会粒计算与知识发现专业委员会委员，主要研究方向为深度学习与粒计算。获吴文俊人工智能自然科学奖二等奖。发表学术论文50余篇。E-mail：ruizhiwang@tongji.edu.cn。
通讯作者:张志飞. E-mail：zhifeizhang@tongji.edu.cn

更新日期/Last Update: 2026-01-05

基于聚类重组和预解析的检索增强生成方法 PDF下载HTML

备注/Memo

基于聚类重组和预解析的检索增强生成方法

PDF下载 HTML