[1]WANG Wenbo,ZHANG Zhifei,WANG Ruizhi,et al.Retrieval-augmented generation based on cluster reorganization and pre-parsing[J].CAAI Transactions on Intelligent Systems,2026,21(1):236-244.[doi:10.11992/tis.202506029]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
21
Number of periods:
2026 1
Page number:
236-244
Column:
吴文俊人工智能科学技术奖论坛
Public date:
2026-03-05
- Title:
-
Retrieval-augmented generation based on cluster reorganization and pre-parsing
- Author(s):
-
WANG Wenbo1; ZHANG Zhifei2; WANG Ruizhi1; MIAO Duoqian1
-
1. School of Computer Science and Technology, Tongji University, Shanghai 201804, China;
2. Project Management Office of China National Scientific Seafloor Observatory, Tongji University, Shanghai 200092, China
-
- Keywords:
-
deep learning; natural language processing; large language models; vector retrieval; question answering; retrieval-augmented generation; clustering algorithms; prompt engineering
- CLC:
-
TP311.1
- DOI:
-
10.11992/tis.202506029
- Abstract:
-
Retrieval-augmented generation(RAG) has garnered remarkable attention for its ability to provide external knowledge to large language models(LLM). However, existing RAG methods often struggle to simultaneously capture both local detailed knowledge and non-contiguous multi-hop knowledge within the original text. To address this issue, this study proposes a novel RAG method based on cluster reorganization and pre-parsing. In the indexing stage, clustering algorithms are used to group discontinuous but relevant knowledge into new chunks, enhancing the retrieval of multi-hop information. Furthermore, prompt engineering is applied to pre-parse these chunks, dividing them into finer-grained sub-units to improve recall during retrieval. In the retrieval stage, all retrieved chunks are restored to their original context blocks and, together with the query, are fed into the LLM to generate the final answer. Ablation and comparative experiments conducted on the QuALITY dataset demonstrate the effectiveness of the proposed method, achieving the best performance on the public leaderboard. The findings of this study provide valuable insights for improving indexing and retrieval technologies in RAG.