[1]赵文清,侯小可.基于词共现图的中文微博新闻话题识别[J].智能系统学报,2012,7(05):444-449.
 ZHAO Wenqing,HOU Xiaoke.News topic recognition of Chinese microblog based on word cooccurrence graph[J].CAAI Transactions on Intelligent Systems,2012,7(05):444-449.
点击复制

基于词共现图的中文微博新闻话题识别(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第7卷
期数:
2012年05期
页码:
444-449
栏目:
出版日期:
2012-10-25

文章信息/Info

Title:
News topic recognition of Chinese microblog based on word cooccurrence graph
文章编号:
1673-4785(2012)05-0444-06
作者:
赵文清侯小可
华北电力大学 控制与计算机工程学院,河北 保定 071003
Author(s):
ZHAO Wenqing HOU Xiaoke
School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China
关键词:
微博新闻话题新闻话题识别主题词词共现图
Keywords:
microblog news topics topic recognition keywords word cooccurrence graph
分类号:
TP391.1
文献标志码:
A
摘要:
针对传统的话题检测算法主要适用于新闻网页和博客等长文本信息,而不能有效处理具有稀疏性的微博数据,给出一种基于词共现图的方法来识别微博中的新闻话题.该方法首先在微博数据预处理之后,综合相对词频和词频增加率2个因素抽取微博数据中的主题词.然后根据主题词间的共现度构建词共现图,把词共现图中每个不连通的簇集看成一个新闻话题,并使用每个簇集中包含信息量较大的几个主题词来表示微博新闻话题.最后在微博数据集上进行实验,实现了对微博中新闻话题的识别,验证了该方法的有效性.
Abstract:
The traditional topic detection algorithm is applied to longer texts such as: news website pages or blogs, causing it to be hard to deal with sparse microblog data effectively. In this paper, a method based on the word cooccurrence graph was provided to detect news topics of microblogs. Firstly, the relative word frequency and the word frequency increase rate were considered to extract new keywords from microblog text after pretreatment. Secondly, a word cooccurrence graph was built by cooccurrence degrees of keywords; each unconnected cluster in a word cooccurrence graph was taken as a news topic by calculating several keywords.These keywords contain much more information in each cluster, was used to represent a news topic of microblog. Finally, data analysis provided evidence on how the approach is most effective and also revealed the microblog data set recognized news topic recognition.

参考文献/References:

[1]MORI M, MIURA T, SHIOYA I. Topic detection and tracking for news web pages[C]//Proceedings of the 2006 ACM International Conference on Web Intelligence. Washington, DC, USA, 2006: 338342.
[2]ALLAN J, CARBONELL J, DODDINGTON G, et al. Topic detection and tracking pilot study: final report[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. San Francisco, USA: Morgan Kaufmann Publisher Inc, 1998: 194218.
[3]路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现[J].模式识别与人工智能, 2012, 25(3): 382387.
LU Rong, XIANG Liang, LIU Mingrong, et al. Discovering news topics from microblogs based on hidden topics analysis and text clustering[J]. Pattern Recognition and Artificial Intelligence, 2012, 25(3): 382387.
[4]LIU Zitao, YU Wenchao, CHEN Wei, et al. Short text feature selection for microblog mining[C]//The 4th International Conference on Computational Intelligence and Software Engineering. Wuhan, China, 2010: 14.
[5]金春霞,周海岩.动态向量的中文短文本聚类[J].计算机工程与应用, 2011, 47(33): 156158. 
JIN Chunxia, ZHOU Haiyan. Chinese short text clustering based on dynamic[J]. Computer Engineering and Applications, 2011, 47(33): 156158.
[6]郑斐然,苗夺谦,张志飞,等.一种中文微博新闻话题检测的方法[J].计算机科学, 2012, 39(1): 138141. 
ZHENG Feiran, MIAO Duoqian, ZHANG Zhifei, et al. News topic detection approach on Chinese microblog[J]. Computer Science, 2012, 39(1): 138141.
[7]杨震,段立娟,赖英旭.基于字符串相似性聚类的网络短文本舆情热点发现技术[J].北京工业大学学报, 2010, 36(5): 669673. 
YANG Zhen, DUAN Lijuan, LAI Yingxu. Online public opinion hotspot detection and analysis based on short text clustering using string distance[J]. Journal of Beijing University of Technology, 2010, 36(5): 669673.
[8]张华平.NLPIR微博内容语料库—23万条[EB/OL]. (20120214)[20120520].
http://www.nlpir.org/?actionviewnewsitemid231.2012,02,14/2012,02,18.
[9]张华平.ICTCLAS2012版本SDK发布(u0106版本修正了UTF8下的bug)[EB/OL]. (20111231)[20120520]. http://www.nlpir.org/?actionviewnewsitemid229.2011,12,31/2012,02,18.
[10]彭泽映,俞晓明,许洪波,等.大规模短文本的不完全聚类[J].中文信息学报, 2011, 25(1): 5459. 
PENG Zeying, YU Xiaoming, XU Hongbo, et al. Incomplete clustering for large scale short texts[J]. Journal of Chinese Information Processing, 2011, 25(1): 5459.
[11]常鹏,马辉.高效的短文本主题词抽取方法[J].计算机工程与应用, 2011, 47(20): 126128, 154. 
CHANG Peng, MA Hui. Efficient short texts keyword extraction method analysis[J]. Computer Engineering and Applications, 2011, 47(20): 126128, 154.
[12]TRIVISON D. Term cooccurrence in cited/citing journal articles as a measure of document similarity[J]. Information Processing & Management, 1987, 23(3): 183194.
[13]乔业男,齐勇,侯迪.一种高稳定性词汇共现模型[J].西安交通大学学报, 2009, 43(6): 2427. 
QIAO Yenan, QI Yong, HOU Di. A highly stable term cooccurrence model[J]. Journal of Xi′an Jiaotong University, 2009, 43(6): 2427.
[14]耿焕同,蔡庆生,赵鹏,等.一种基于词共现图的文档自动摘要研究[J].情报学报, 2005, 24(6): 651656. 
GENG Huantong, CAI Qingsheng, ZHAO Peng, et al. Research on document automatic summarization based on word cooccurrence[J]. Journal of The China Society for Scientific and Technical Information, 2005, 24(6): 651656.
[15]常鹏,冯楠,马辉.一种基于词共现的文档聚类算法[J].计算机工程, 2012, 38(2): 213214, 220. 
CHANG Peng, FENG Nan, MA Hui. Document clustering algorithm based on word cooccurrence[J]. Computer Engineering, 2012, 38(2): 213214, 220.
[16]耿焕同,蔡庆生,于琨,等.一种基于词共现图的文档主题词自动抽取算法[J].南京大学学报:自然科学, 2006, 42(2): 156162. 
GENG Huantong, CAI Qingsheng, YU Kun, et al. A kind of automatic text keyphrase extraction method based on word cooccurrence[J]. Journal of Nanjing University: Natural Sciences, 2006, 42(2): 156162.

相似文献/References:

[1]刘志雄,贾彩燕.面向用户兴趣与社区关系的微博话题检测方法[J].智能系统学报,2016,11(3):294.[doi:10.11992/tis.201603341]
 LIU Zhixiong,JIA Caiyan.Micro-blog topic detection based on users’ interests and communities[J].CAAI Transactions on Intelligent Systems,2016,11(05):294.[doi:10.11992/tis.201603341]
[2]赵文清,侯小可,沙海虹.语义规则在微博热点话题情感分析中的应用[J].智能系统学报,2014,9(01):121.[doi:10.3969/j.issn.1673-4785.201208020]
 ZHAO Wenqing,HOU Xiaoke,SHA Haihong.Application of semantic rules to sentiment analysis of microblog hot topics[J].CAAI Transactions on Intelligent Systems,2014,9(05):121.[doi:10.3969/j.issn.1673-4785.201208020]

备注/Memo

备注/Memo:
收稿日期:2012-05-26.
网络出版日期:2012-09-17.
基金项目:国家自然科学基金资助项目(70671039);中央高校基本科研业务费专项资金资助项目(12MS121). 
通信作者:侯小可.
E-mail: houxiaoke2008@163.com.
作者简介:
赵文清,女,1973年生,副教授,中国人工智能学会粗糙集与软计算专业委员会委员.主要研究方向为机器学习、数据挖掘、贝叶斯网络学习等.获河北省科技进步三等奖1项,国家发明专利1项.发表学术论文30余篇,出版教材3部.
侯小可,男,1985年生,硕士研究生,主要研究方向为人工智能、数据挖掘.
更新日期/Last Update: 2012-11-13