<-Previous Article Next Article->

[1]LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(1):45-52.

Copy

Web key resource page selection based on non-content inf o rmation

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 2 Number of periods: 2007 1 Page number: 45-52 Column: 学术论文—智能系统 Public date: 2007-02-25

Title:: Web key resource page selection based on non-content inf o rmation

Author(s):: LIU Yi-qun; ZHANG Min; MA Shao-ping; State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijin g 100084, China

Keywords:: web information retrieval; key resource page; topic distillation; link structure analysis

CLC:: TP181,TP391.3

DOI:: -

Abstract:: Information growth makes it impossible for search engines to crawl and index all pages on the Web. Meanwhile indexed page set is filled with low quali ty information and spam. It is quite a challenge to select high quality Web page s (key resource pages) queryindependently. With analysis in noncontent featu re s of key resources, a preselection method was introduced in topic distillation research. A decision tree was constructed to locate key resource pages using que ryindependent noncontent features including indegree, document length, URL ty pe and two novel proposed features involving site’s selflink structure analys i s. Although the result page set contained only about 20% pages of the whole coll ection, it covered more than 70% of key resources. Furthermore, information retr ieval on this page set made more than 60% improvement with respect to that on al l pages. It shows an effective way to get better performance in topic distillati on with a smaller data set.

References:: ［1］SULLIVAN D. Search engine sizes ［EB/OL］. From search engine watch web si te http://searchenginewatch.com/reports/article.php/2156481, 2005-01-28/2005-0 6-18. 
［2］LYMAN P, HAL R V. How much information 2003 ［EB/OL］. On line at: http:// www.sims.berkeley.edu/howmuchinfo2003, 2003-10-30/2005-06-18.
［3］MONIKA R H, MOTWANI R, SILVERSTEIN C. Challenges in web search engines ［A ］. Georg Gottlob, Toby Walsh eds. IJCAI-03, Proceedings of the Eighteenth Internati onal Joint Conference on Artificial Intelligence ［C］. San Francisco: Morgan Ka ufmann Press, 2003. 
［4］HAWKING D, CRASWELL N. Overview of the TREC-2002 web track ［A］. In Voorh ees and Buckland ［6］［C］. ［s.l.］,2002.
［5］HAWKING D, CRASWELL N. Overview of the TREC 2003 web track ［EB/OL］. On l ine at: http://trec.nist.gov/pubs/trec12/papers/WEB.OVERVIEW.pdf, 2004-02/2005-01. 
［6］VOORHEES E M, BUCKLAND P L. The eleventh text retrieval conference (TREC-2 00 2), volume 11 ［M］. National Institute of Standards and Technology, NIST, 2003. 
［7］DAVISON B D. Topical locality in the web ［A］. Proceedings of the 23rd An nual International Conference on Research and Development in Information Retrieval ［C］. ［s.l.］,2000. 
［8］BHARAT K, HENZINGER M. Improved algorithms for topic distillation in a hyp er linked environment ［A］. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval ［C］.［s.l.］, 1998. 
［9］CRASWELL N, HAWKING D. Query-independent evidence in home page finding ［J ］. In ACM Transactions on Information Systems (TOIS), 2003, 21(3): 286-313. 
［10］WESTERVELD T, HIEMSTRA D, KRAAIJ W. Retrieving web pages using content, l inks, URLs and anchors ［A］. In Voorhees and Harman ［7］［C］. ［s.l.］,2000.
［11］KRAAIJ W, WESTERVELD T, HIEMSTRA D. The importance of prior probabilities f or entry page search ［A］. In 25th annual international ACM SIGIR conference on research and development in information retrieval ［C］. pages 27-34.
［12］BRODER A. A taxonomy of Web search ［J］. SIGIR Forum, 2002, 36(2):1-8.
［13］[JP3]CRASWELL N, HAWKING D. Stephen robertson. effective site finding usin g li nk anchor information ［A］. In 24th ACM-SIGIR Conference on Research and Developm ent in Information Retrieval ［C］. pages 250-257.
［14］MITCHELL T M. Chapter 3: Decision Tree Learning, in Machine Learning ［M ］. McGrawHill International Editions, 1997.
［15］RIJSBERGEN C J. Information Retireval ［M］. Butterworths, London, 1979. 
［16］HAWKING D, CRASWELL N.Overview of the TREC-2001 web track ［A］.In Voorhees and Harman ［7］［C］.［s.l.］，2001.

Similar References:

Memo

Last Update: 2009-05-05

Web key resource page selection based on non-content inf o rmation PDF DownloadHTML

Memo

Web key resource page selection based on non-content inf o rmation

PDF Download HTML