<-上一篇/Previous Article 下一篇/Next Article->

[1]杨潇,马军,杨同峰,等.主题模型LDA的多文档自动文摘[J].智能系统学报,2010,5(2):169-176.
　YANG Xiao,MA Jun,YANG Tong-feng,et al.Automatic multidocument summarization based onthe latent Dirichlet topic allocation model[J].CAAI Transactions on Intelligent Systems,2010,5(2):169-176.

点击复制

主题模型LDA的多文档自动文摘

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 5 期数: 2010年第2期页码: 169-176 栏目: 学术论文—自然语言处理与理解出版日期: 2010-04-25

Title:: Automatic multidocument summarization based onthe latent Dirichlet topic allocation model

文章编号:: 1673-4785(2010)02-0169-08

作者:: 杨潇¹,马军²,杨同峰²,杜言琦²,邵海敏²; 1.山东经济学院信息管理学院，山东济南 250014；
?2.山东大学计算机科学与技术学院，山东济南 250101

Author(s):: YANG Xiao¹， MA Jun²， YANG Tong-feng²， DU Yan-qi²， SHAO Hai-min²; 1. School of Information Management， Shandong Economic University， Ji’nan 250014， China;
2. School of Computer Science and Technology， Shandong University， Ji’nan 250101, China

关键词:: 多文档自动文摘; 句子分值计算; 主题模型; LDA; 主题数目

Keywords:: multidocument summarization; sentence scoring; topic model; latent dirichlet allocation; number of topics

分类号:: TP391

文献标志码:: A

摘要:: 近年来使用概率主题模型表示多文档文摘问题受到研究者的关注.LDA (latent dirichlet allocation)是主题模型中具有代表性的概率生成性模型之一.提出了一种基于LDA的文摘方法，该方法以混乱度确定LDA模型的主题数目，以Gibbs抽样获得模型中句子的主题概率分布和主题的词汇概率分布，以句子中主题权重的加和确定各个主题的重要程度，并根据LDA模型中主题的概率分布和句子的概率分布提出了2种不同的句子权重计算模型.实验中使用ROUGE评测标准，与代表最新水平的SumBasic方法和其他2种基于LDA的多文档自动文摘方法在通用型多文档摘要测试集DUC2002上的评测数据进行比较，结果表明提出的基于LDA的多文档自动文摘方法在ROUGE的各个评测标准上均优于SumBasic方法，与其他基于LDA模型的文摘相比也具有优势.

Abstract:: The representative problem of multidocument summarization using probabilistic topic models has begun receiving considerable attention. A multidocument summarization method was proposed based on the latent dirichlet allocation (LDA) model, itself a model representative of probabilistic generative topic models. In this method, the number of topics in the LDA model was determined by model perplexity, and the probabilistic sentence distribution on topics and the probabilistic topic distribution on words were obtained by the Gibbs sampling method. The importance of topics was determined by the sum of topic weights on all sentences. Two sentencescoring methods were proposed, one based on sentence distribution and the other on topic distribution. Evaluated by the recalloriented understudy for gisting evaluation (ROUGE) metrics, results of the both proposed methods surpassed the stateoftheart SumBasic system and the other two LDA based summarization systems for all the ROUGE scores on the DUC2002 generic multidocument summarization test set. 

参考文献/References:: ［1］RADEV D R, HOVY E, MCKEOWN K. Introduction to the special issue on text summarization［J］. Computational Linguistics,2002,28(4):399408.
［2］LEE J H, SUN P, AHN C M, et al. Automatic generic document summarization based on nonnegative matrix factorization［J］. Information Processing and Management, 2009,45(1):2034.
［3］徐永东,徐志明,王晓龙. 基于信息融合的多文档自动文摘技术［J］.计算机学报, 2007,30(11):20482054.
?XU Yongdong, XU Zhiming, WANG Xiaolong. Multidocument automatic summarization technique based on information fusion［J］. Chinese Journal of Computers, 2007, 30(11):20482054.
［4］HIRAO T, ISOZAKI H, MAEDA E, et al. Extracting important sentences with support vector machines［C］//Proc of the 19th International Conference on Computational Linguistics. Taipei, China, 2002: 17.
［5］NENKOVA A, VANDERWENDE L. The impact of frequency on summarization:MSRTR2005101［R］. Redmond, USA: Microsoft Research,2005.
［6］LINC Y, HOVY E. The automated acquisition of topic signatures FOR text summarization［C］//Proc of the 18th International Conference on Computational Linguistics. Sarbrflcken, Germany,2000:271278．
［7］ANTIQUEIRA L, Jr OLIVEIRA O N. A complex network approach to text summarization［J］. Information Science, 2009 (179):584599.
［8］WAN X J, YANG J W. Multidocument summarization using clusterbased link analysis［C］//Proc of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK, 2008:299306.
［9］HARABAGIU S, HICKL A, LACATUSU F. Satisfying information needs with multidocument summaries［J］. Information Processing and Management, 2007,43(6):16191642.
［10］BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation［J］. Journal of Machine Learning Research, 2003 (3):9931022.
［11］HAGHIGHI A, VANDERWENDE L. Exploring content models for multidocument summarization［C］//Human Language Technologies: the Annual Conference of the North American Chapter of the ACL Boulder. Colorado, 2009:362370.
［12］ARORA R, RAVINDRAN B. Latent Dirichlet allocation based multidocument summarization［C］//Proc of the Second Workshop on Analytics for Noisy Unstructured Text data. Singapore, 2008:9197.
［13］ARORA R, RAVINDRAN B. Latent Dirichlet allocation and singular value decomposition based multidocument summarization［C］//Proc of Eighth IEEE International Conference on Data Mining. Pisa, Italy, 2008：713718.
［14］CHEN Y T, CHEN B, WANG H M. A probabilistic generative framework for extractive broadcast news speech summarization［J］. IEEE Trans on Audio, Speech, and Language Processing,2009,17(1): 95106.
［15］SHAFIEI M M, MILIOS E E. Latent Dirichlet coclustering［C］//Proceedings of the Sixth International Conference on Data Mining (ICDM). Hong Kong, China, 2006: 542551.
［16］CHANG Y L, CHIEN J T. Latent Dirichlet learning for document summarization［C］//IEEE International Conference on Acoustics, Speech, and Signal Processing. Taipei, China, 2009:16891692.
［17］LIN C Y. ROUGE: a package for automatic evaluation of summaries［C］//Workshop on Text Summarization Branches Out.［S.l.］，Spain, 2004: 7481.
［18］STEYVERS M, GRIFFITHS T. Probabilistic topic models［C］//Handbook of Latent Semantic Analysis. Laurence Erlbaum, 2007：115.
［19］TEH Y W, JORDAN M I, BEAL M J, et al. Hierarchical Dirichlet processes［J］. Journal of the American Statistical Association, 2006, 101(476):15661581.
?［20］秦兵, 刘挺, 李生. 基于局部主题判定与抽取的多文档文摘技术［J］. 自动化学报,2004,30(6):905910. QIN Bing, LIU Ting, LI Sheng. Multidocument summarization based on local topics identification and extraction［J］. Acta Automatica Sinica, 2004，30(6): 905910.
［21］石晶, 胡明, 石鑫, 等. 基于LDA模型的文本分割［J］.计算机学报, 2008.31(10):18651873. SHI Jing, HU Ming, SHI Xin, et al. Text segmentation based on model LDA［J］. Chinese Journal of Computers, 2008，31(10):18651873.

备注/Memo

收稿日期：2010-01-05.
基金项目：
国家自然科学基金资助项目(60970047)；
山东省自然科学基金资助项目(Y2008G19)；
山东省科技计划资助项目(2007GG10001002, 2008GG10001026).
通信作者：杨潇.E-mail:yangx@mail.sdu.edu.cn.
作者简介：
杨潇，女，1981年生，博士，主要研究方向为自然语言处理.发表学术论文10余篇.
马军，男，1956年生，教授，博士生导师，主要研究方向为算法分析与设计、信息检索和并行计算.曾主持2项国家“863”计划金项目课题，1项国家自然基金课题，2项教育部基金课题和多项省基金课题.发表学术论文80余篇.
?杨同峰，男，1985年生，博士研究生，主要研究方向为个性化检索和图像标注.

更新日期/Last Update: 2010-05-24

主题模型LDA的多文档自动文摘 PDF下载HTML

备注/Memo

主题模型LDA的多文档自动文摘

PDF下载 HTML