[1]LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(1):45-52.
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
2
Number of periods:
2007 1
Page number:
45-52
Column:
学术论文—智能系统
Public date:
2007-02-25
- Title:
-
Web key resource page selection based on non-content inf o rmation
- Author(s):
-
LIU Yi-qun; ZHANG Min; MA Shao-ping
-
State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijin g 100084, China
-
- Keywords:
-
web information retrieval; key resource page; topic distillation; link structure analysis
- CLC:
-
TP181,TP391.3
- DOI:
-
-
- Abstract:
-
Information growth makes it impossible for search engines to crawl and index all pages on the Web. Meanwhile indexed page set is filled with low quali ty information and spam. It is quite a challenge to select high quality Web page s (key resource pages) queryindependently. With analysis in noncontent featu re s of key resources, a preselection method was introduced in topic distillation research. A decision tree was constructed to locate key resource pages using que ryindependent noncontent features including indegree, document length, URL ty pe and two novel proposed features involving site’s selflink structure analys i s. Although the result page set contained only about 20% pages of the whole coll ection, it covered more than 70% of key resources. Furthermore, information retr ieval on this page set made more than 60% improvement with respect to that on al l pages. It shows an effective way to get better performance in topic distillati on with a smaller data set.