[1]刘奕群,张 敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报,2007,2(01):45-52.
 LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(01):45-52.
点击复制

基于非内容信息的网络关键资源有效定位(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第2卷
期数:
2007年01期
页码:
45-52
栏目:
出版日期:
2007-02-25

文章信息/Info

Title:
Web key resource page selection based on non-content inf o rmation
文章编号:
1673-4785(2007)01-0045-08
作者:
刘奕群张 敏马少平
清华大学智能技术与系统国家重点实验室,北京100084
Author(s):
LIU Yi-qunZHANG MinMA Shao-ping
State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijin g 100084, China
关键词:
网络信息检索关键资源页面主题过滤机器学习
Keywords:
web information retrieval key resource page topic distillation link structure analysis
分类号:
TP181,TP391.3
文献标志码:
A
摘要:
网络信息的爆炸式增长,使得当前任何搜索引擎都只可能索引到Web上一小部分数据,而其中又充斥着大量的低质量信息.如何在用户查询无关的条件下找到Web上高质量的关键资源,是Web信息检索面临的挑战.基于大规模网页统计的方法发现,多种网页非内容特征可以用于关键资源页面的定位,利用决策树学习方法对这些特征进行综合,即可以实现用户查询无关的关键资源页面定位.在文本信息检索会议(TREC)标准评测平台上进行的超过19G文本数据规模的实验表明,这种定位方法能够利用20%左右的页面覆盖超过70%的Web关键信息;在仅为全部页面24%的关键资源集合上的检索结果,比在整个页面集合上的检索有超过60%的性能提高.这说明使用较少的索引量获取较高的检索性能是完全可能的.
Abstract:
Information growth makes it impossible for search engines to crawl and index all pages on the Web. Meanwhile indexed page set is filled with low quali ty information and spam. It is quite a challenge to select high quality Web page s (key resource pages) queryindependently. With analysis in noncontent featu re s of key resources, a preselection method was introduced in topic distillation research. A decision tree was constructed to locate key resource pages using que ryindependent noncontent features including indegree, document length, URL ty pe and two novel proposed features involving site’s selflink structure analys i s. Although the result page set contained only about 20% pages of the whole coll ection, it covered more than 70% of key resources. Furthermore, information retr ieval on this page set made more than 60% improvement with respect to that on al l pages. It shows an effective way to get better performance in topic distillati on with a smaller data set.

参考文献/References:

[1]SULLIVAN D. Search engine sizes [EB/OL]. From search engine watch web si te http://searchenginewatch.com/reports/article.php/2156481, 2005-01-28/2005-0 6-18. 
[2]LYMAN P, HAL R V. How much information 2003 [EB/OL]. On line at: http:// www.sims.berkeley.edu/howmuchinfo2003, 2003-10-30/2005-06-18.
[3]MONIKA R H, MOTWANI R, SILVERSTEIN C. Challenges in web search engines [A ]. Georg Gottlob, Toby Walsh eds. IJCAI-03, Proceedings of the Eighteenth Internati onal Joint Conference on Artificial Intelligence [C]. San Francisco: Morgan Ka ufmann Press, 2003. 
[4]HAWKING D, CRASWELL N. Overview of the TREC-2002 web track [A]. In Voorh ees and Buckland [6] [C]. [s.l.],2002.
[5]HAWKING D, CRASWELL N. Overview of the TREC 2003 web track [EB/OL]. On l ine at: http://trec.nist.gov/pubs/trec12/papers/WEB.OVERVIEW.pdf, 2004-02/2005-01. 
[6]VOORHEES E M, BUCKLAND P L. The eleventh text retrieval conference (TREC-2 00 2), volume 11 [M]. National Institute of Standards and Technology, NIST, 2003. 
[7]DAVISON B D. Topical locality in the web [A]. Proceedings of the 23rd An nual International Conference on Research and Development in Information Retrieval [C]. [s.l.],2000. 
[8]BHARAT K, HENZINGER M. Improved algorithms for topic distillation in a hyp er linked environment [A]. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval [C].[s.l.], 1998. 
[9]CRASWELL N, HAWKING D. Query-independent evidence in home page finding [J ]. In ACM Transactions on Information Systems (TOIS), 2003, 21(3): 286-313. 
[10]WESTERVELD T, HIEMSTRA D, KRAAIJ W. Retrieving web pages using content, l inks, URLs and anchors [A]. In Voorhees and Harman [7] [C]. [s.l.],2000.
[11]KRAAIJ W, WESTERVELD T, HIEMSTRA D. The importance of prior probabilities f or entry page search [A]. In 25th annual international ACM SIGIR conference on research and development in information retrieval [C]. pages 27-34.
[12]BRODER A. A taxonomy of Web search [J]. SIGIR Forum, 2002, 36(2):1-8.
[13][JP3]CRASWELL N, HAWKING D. Stephen robertson. effective site finding usin g li nk anchor information [A]. In 24th ACM-SIGIR Conference on Research and Developm ent in Information Retrieval [C]. pages 250-257.
[14]MITCHELL T M. Chapter 3: Decision Tree Learning, in Machine Learning [M ]. McGrawHill International Editions, 1997.
[15]RIJSBERGEN C J. Information Retireval [M]. Butterworths, London, 1979. 
[16]HAWKING D, CRASWELL N.Overview of the TREC-2001 web track [A].In Voorhees and Harman [7] [C].[s.l.],2001.

备注/Memo

备注/Memo:
收稿日期:2006-04-23.
基金项目:
国家重点基础研究(973)资助项目(2004CB318108);
国家自然科学基金资助项目(60223004, 60321002, 60303005, 60503064);
教育部科学技术研究重点资助项目(104 236).
作者简介:
刘奕群,男,1981年生,博士研究生.主要研究方向为信息检索、机器学习与网络用户行为分析.发表学术论文10余篇.
 E-mail:liuyiqun03@mails.tsinghua.edu.cn.
张敏,女,1977年生,助理研究员. 主要研究方向为信息检索、机器学习、自然语言处理、基于认知的信息处理,以及在网络环境下用户行为模式的抽取和分析,及其对相关网络信息获取技术.发表学术论文40余篇
马少平,男,1961年生,教授,博士生导师.主要研究方向为知识工程、信息检索、汉字识别与后处理以及中文古籍数字化.承担过多项国家自然科学基金、“863”高技术项目、“973”项目及国际合作项目.在脱机手写体汉字识别和后处理方面达到了国际先进水平.“脱机手写体汉字与数字识别系统”1998年1月获得国家教委科技进步二等奖.发表论文60余篇,出版教材2 部. 
更新日期/Last Update: 2009-05-05