[1]王占一,徐蔚然,郭军.智能文本搜索新技术[J].智能系统学报,2012,7(01):40-49.
 WANG Zhanyi,XU Weiran,GUO Jun.New technologies of intelligent text search[J].CAAI Transactions on Intelligent Systems,2012,7(01):40-49.
点击复制

智能文本搜索新技术(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第7卷
期数:
2012年01期
页码:
40-49
栏目:
出版日期:
2012-02-25

文章信息/Info

Title:
New technologies of intelligent text search
文章编号:
1673-4785(2012)01-0040-10
作者:
王占一12徐蔚然12郭军12
1.北京邮电大学 模式识别与智能系统实验室,北京 100876;
2.北京邮电大学 信息与通信工程学院,北京 100876
Author(s):
WANG Zhanyi12 XU Weiran12 GUO Jun12
1. Pattern Recognition and Intelligent System (PRIS) Laboratory, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
关键词:
智能文本搜索文本检索文本分析
Keywords:
intelligent text search text retrieval text analysis
分类号:
TP393
文献标志码:
A
摘要:
面对当今互联网上海量的信息,以及搜索信息准确、高效、个性化等需求,提出了一套包括信息检索、信息抽取和信息过滤在内的智能文本搜索新技术.首先举荐了与信息检索新技术相关的企业检索、实体检索、博客检索、相关反馈子任务.然后介绍了与信息抽取技术相关的实体关联和实体填充子任务,以及与信息过滤技术相关的垃圾邮件过滤子任务.这些关键技术融合在一起,在多个著名的国际评测中得到应用,如美国主办的文本检索会议评测和文本分析会议评测,并且在互联网舆情、短信舆情和校园网对象搜索引擎等实际系统中得到了检验.
Abstract:
To adapt to the massive amount of information on the internet and the need for accuracy, efficiency, and individualization, a set of technologies of intelligent text search including information retrieval, extraction, and filtering were proposed. First, new technologies of information retrieval were illustrated including the subtasks of enterprise retrieval, entity retrieval, blog retrieval, and relevance feedback. Second, the subtask of entity linking and slot filling related to information extraction was introduced. Finally, the subtask of spam email filtering related to information filtering was described. These technologies were converged for application in many wellknown international evaluations. These include the text retrieval conference (TREC) and text analysis conference (TAC) sponsored in the USA, and these technologies of intelligent text search were proven in practical applications such as public opinions on the Internet, short message opinions, and the campus object search engine (COSE).

参考文献/References:

[1]郭军.Web搜索[M].北京:高等教育出版社, 2009: 13.
[2]方慧.TREC发展历程及现状分析[J].新世纪图书馆, 2010(1): 57.
 FANG Hui. On developing course and status analysis of TREC[J]. New Century Library, 2010(1): 57.
[3]BALOG K, SOBOROFF I, THOMAS P, et al. Overview of the TREC 2008 enterprise track[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec17/papers/ENTERPRISE.OVERVIEW.pdf.
[4]RU Zhao, CHEN Yuehua, XU Weiran, et al. TREC2005 enterprise track experiments at BUPT[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec14/papers/ beijinguofpt.ent.pdf.
[5]RU Zhao, LI Qian, XU Weiran, et al. BUPT at TREC 2006: enterprise track[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec15/papers/beijingupt.ent.final.pdf.
[6]BAILEY P, CRASWELL N. Overview of the TREC 2007 enterprise track[EB/OL]. [20101215]. http://trec. nist.gov/pubs/trec16/papers/ENT.OVERVIEW16.pdf.
[7]WANG Zhanyi, LIU Dongxin, XU Weiran, et al. BUPT at TREC 2009: entity track[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec18/papers/bupt.ENT.pdf.
[8]ZHANG Suxiang, WEN Juan, WANG Xiaojie, et al. Automatic entity relation extraction based on maximum entropy[C]//Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications. Ji’nan, China, 2006: 540544.
[9]LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers Inc, 2001: 282289.
[10]MACDONALD C, OUNIS I. Voting for candidates: adapting data fusion techniques for an expert search task[C]//Proceedings of the 15th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2006: 387396.
[11]MANNING C D, RAGHAVAN P, SCHUTZE H, An introduction to information retrieval[M]. Cambridge, UK: Cambridge University Press, 2008: 120126.
[12]WILSON T, WIEBE J, HOFFMANN P, Recognizing contextual polarity in phraselevel sentiment analysis[C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Stroudsburg, USA: Association for Computational Linguistics, 2005: 347354.
[13]MANNING C D, SCHTZE H. Foundations of statistical natural language processing[M]. Cambridge, USA: The MIT Press, 1999.
[14]AMATI G. Probabilistic models for information retrieval based on divergence from randomness[D]. Glasgow, UK: University of Glasgow, 2003.
[15]SINGHAL A, BUCKLEY C, MITRA M. Pivoted document length normalization[C]//Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 1996: 2129.
[16]LI Si, LI Xinsheng. PRIS at 2009 relevance feedback track: experiments in language model for relevance feedback[EB/OL]. [20101215]. http://trec.nist.gov/pubs/ trec18/papers/pris.RF.pdf.
[17]LALMAS M, MACFARLANE A, RUGER S. Advances in information retrieval[M]. New York, USA: SpringerVerlag, 2002: 74172.
[18]PONTE J M, CROFT W B. A language modeling approach to information retrieval[C]//Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 1998: 275281.
[19]WANG Bingqing, HUANG Xuanjing. Relevance feedback based on constrained clustering: FDU at TREC’09[EB/OL]. [20101215]. http://trec.nist.gov/pubs/ trec18/papers/fudanu.RF.pdf.
[20]LAVRENKO V, CROFT W B. Relevancebased language models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2001: 120127.
[21]CHANG Chihchung, LIN Chihjen. LIBSVM: a library for support vector machines[EB/OL]. [20110409]. http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
[22]The Lemur Project. INDRI: language modeling meets inference networks[EB/OL]. [20110323]. http://www.lemurproject. org/indri/.
[23]TAC 2009. Knowledge base population track[EB/OL]. (20090929)[20101216]. http://apl.jhu.edu/~paulmac/kbp.html.
[24]TAC 2010. Knowledge base population (KBP2010) track[EB/OL]. (20100912)[20101216]. http://nlp.cs.qc.cuny.edu/kbp/2010/.
 [25]CRF++: yet another CRF toolkit[EB/OL]. [201012 16]. http://crfpp.sourceforge.net/.
[26]YANG Zhen, XU Weiran, CHEN Bo, et al. PRIS Kidult antiSPAM solution at the TREC 2005 spam track: improving the performance of naive Bayes for spam detection[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec14/papers/beijinguofpt.spam.pdf.
[27]YANG Zhen, XU Wei, CHEN Bo, et al. BUPT at TREC 2006: spam track[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec15/papers/beijingupt.spam.final.pdf.
[28]CORMACK G V. TREC 2007 spam track overview[EB/OL]. [20101215]. http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf.
 [29]杨震.文本分类和聚类中若干问题的研究[D].北京:北京邮电大学, 2007: 1086.
 YANG Zhen. Research on key problems in text classification and clustering[D]. Beijing: Beijing University of Posts and Telecommunications, 2007: 1086.

备注/Memo

备注/Memo:
收稿日期: 2011-01-02.
网络出版时间: 2012-02-18.
基金项目:国家自然科学基金资助项目(60905017);高等学校学科创新引智计划项目(B08004). 
通信作者:王占一.         E-mail:wangzhanyi@gmail.com.
作者简介:
王占一,男,1984年生,博士研究生,主要研究方向为信息过滤和信息检索等.在国内外重要期刊和会议上发表学术论文10篇,获发明专利2项.
 徐蔚然,男,1975年生,副教授,主要研究方向为信息检索、模式识别和机器学习.主持参加了TREC、TAC、ACE等国际著名检索评测,并且获得优异成绩,参与多项国家级科研项目,发表学术论文20余篇.
郭军,男,1959年生,教授,博士生导师,主要研究方向为模式识别、网络管理、信息检索、基于内容的信息安全等.主持多项“863”计划项目和国家自然科学基金项目,获省部级奖励多项,发表学术论文上百篇,获授权专利5项.
更新日期/Last Update: 2012-05-07