[1]张森,张晨,林培光,等.基于用户查询日志的网络搜索主题分析[J].智能系统学报,2017,12(05):668-677.[doi:10.11992/tis.201706096]
 ZHANG Sen,ZHANG Chen,LIN Peiguang,et al.Web search topic analysis based on user search query logs[J].CAAI Transactions on Intelligent Systems,2017,12(05):668-677.[doi:10.11992/tis.201706096]
点击复制

基于用户查询日志的网络搜索主题分析(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第12卷
期数:
2017年05期
页码:
668-677
栏目:
出版日期:
2017-10-25

文章信息/Info

Title:
Web search topic analysis based on user search query logs
作者:
张森1 张晨12 林培光1 张春云1 郭玉超1 任威龙1 任可2
1. 山东财经大学 计算机科学与技术学院, 山东 济南 250014;
2. 香港科技大学 计算机科学及工程学系, 香港 999077
Author(s):
ZHANG Sen1 ZHANG Chen12 LIN Peiguang1 ZHANG Chunyun1 GUO Yuchao1 REN Weilong1 REN Ke2
1. School of Computer Science & Technology, Shandong University of Finance & Economics, Jinan 250014, China;
2. Department of Computer Science & Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China
关键词:
网络搜索搜索引擎自然语言处理主题模型文本挖掘突发性时间分析参数估计
Keywords:
web searchsearch enginenatural language processingtopic modeldata miningburstinesstemporal analysisparameter estimate
分类号:
TP391
DOI:
10.11992/tis.201706096
摘要:
网络搜索分析在优化搜索引擎方面具有举足轻重的作用,而且对用户个人搜索特性进行分析能够提高搜索引擎的精准度。目前,大多数已有模型(比如点击图模型及其变体),注重研究用户群体的共同特点。然而,关于如何做到既可以获取用户群体共同特点又可以获取用户个人特点方面的研究却非常少。本文研究了基于个人用户网络搜索分析新问题,即通过研究用户搜索的突发性现象,获取个人用户搜索查询的主题分布情况。提出了两个搜索主题模型,即搜索突发性模型(SBM)和耦合敏感搜索突发性模型(CS-SBM)。SBM假设查询词和URL主题是无关的,CS-SBM假设查询词和URL之间是有主题关联的,得到的主题分布信息存储在偏Dirichlet先验中,采用Beta分布刻画用户搜索的时间特性。实验结果表明,每一个用户的网络搜索轨迹都有多种基于用户的独有特点。同时,在使用大量真实用户查询日志数据情况下,与LDA、DCMLDA、TOT相比,本文提出的模型具有明显的泛化性能优势,并且有效地描绘了用户搜索查询主题在时间上的变化过程。
Abstract:
Web search analysis plays a critical role in improving the performance of contemporary search engines. In addition, search engine accuracy can be improved by analyzing the individual search properties of users. Most existing models, such as the click graph and its variants, focus on the common characteristics of the group. However, as yet, there has been little investigation of a model that would obtain both the collective group characteristics and the unique characteristics of individual users. In this paper, we investigate user-specific web search analysis, whereby we obtain the topic distributions of the search queries of individual users by determining the burstiness of user searches. We propose two topic models, i.e., the search burstiness model (SBM) and the coupling-sensitive search burstiness model (CS-SBM). The SBM adopts the assumption that the query words and URL are topically independent, The CS-SBM supposes that the query words and URL are topically relevant. The obtained topic distribution information is stored in skewed Dirichlet priors and a beta distribution is used to capture the temporal properties of the user searches. Our experimental results show that each user’s web search trail has unique characteristics, and that in the case of there being a large amount of real query log data, in comparison to the latent Dirichlet allocation (LDA) and topic over time (TOT) models, our proposed models have advantages with respect to generalized performance and effectively describes the temporal change process of user search queries.

参考文献/References:

[1] SUNEHAG P. Using two-stage conditional word frequency models to model word burstiness and motivating TF-IDF[J]. Journal of machine learning reasearch, 2017, 2:8.
[2] ELKAN C. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution[C]//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh. Pennsylvania, USA, 2006:289-296.
[3] DOYLE G, ELKAN C. Accounting for burstiness in topic models[C]//Proceedings of the 26th Annual International Conference on Machine Learning Montreal. QC, Canada, 2009:281-288.
[4] XUE G R, ZENG H J, CHEN Z, et al. Optimizing web search using web click-through data[C]//Proceedings of the thirteenth ACM international conference on Information and Knowledge Management. Washington, USA, 2004:118-126.
[5] GUO F, LIU C, WANG Y M. Efficient multiple-click models in web search[C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining. Barcelona, Spain, 2009:124-131.
[6] 张宇, 宋巍, 刘挺, 等. 基于URL主题的查询分类方法[J]. 计算机研究与发展, 2012, 49(6):1298-1305. ZHANG Yu, SONG Wei, LIU Ting, et al. Query classification based on url topic[J]. Journal of computer research and development, 2012, 49(6):1298-1305.
[7] MADSEN R E, KAUCHAK D, ELKAN C. Modeling word burstiness using the dirichlet distribution[C]//Proceedings of the 22nd international conference on Machine iearning. Bonn, Germany, 2005:545-552.
[8] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine learning research, 2003, 3(1):993-1022.
[9] WANG X, MCCALLUM A. Topics over time:a non-Markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. Philadelphia, USA, 2006:424-433.
[10] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8):1423-1436. XU Ge, WANG Houfeng. The development of topic model in natural language processing[J]. Chinese journal of computers, 2011, 34(8):1423-1436.
[11] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展, 2011, 48(10):1795-1802.ZHANG Chenyi, SUN Jianling, DING Yiqun. Topic mining for microblog based on mb-lda model[J]. Journal of computer research and development, 2011, 48(10):1795-1802.
[12] 刘少鹏, 印鉴, 欧阳佳, 等. 基于MB-HDP模型的微博主题挖掘[J]. 计算机学报, 2015, 38(7):1408-1419. LIU Shaopeng, YIN Jian, OUYANG Jia, et al. Topic mining from microblogs based on MB-HDP model[J]. Chinese Journal of Computers, 2015, 38(7):1408-1419.
[13] JIANG D, TONG Y, SONG Y. Cross-lingual topic discovery from multilingual search engine query log[J]. ACM transactions on information systems (TOIS), 2016, 35(2):9.
[14] JIANG D, LEUNG K W T, NG W. Query intent mining with multiple dimensions of web search data[J]. World wide web, 2016, 19(3):475.
[15] JIANG D, YANG L. Query intent inference via search engine log[J]. Knowledge and information systems, 2016, 49(2):661-685.
[16] HUANG J, EFTHIMIADIS E N. Analyzing and evaluating query reformulation strategies in web search logs[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, 2009:77-86.
[17] GRIFFITHS T L, STEYVERS M. Finding scientific topics[J]. Proceedings of the national academy of sciences, 2004, 101(1):5228-5235.
[18] ZHU C, BYRD R H, LU P, et al. Algorithm 778:L-BFGS-B:Fortran subroutines for large-scale bound-constrained optimization[J]. ACM transactions on mathematical software (TOMS), 1997, 23(4):550-560.
[19] MANNING C D, RAGHAVAN P, SCHVTZE H. Introduction to information retrieval[M]. Cambridge:Cambridge University Press, 2008:1-16.
[20] JIANG D, NG W. Mining web search topics with diverse spatiotemporal patterns[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 2013:881-884.
[21] LI W, MCCALLUM A. Pachinko allocation:DAG-structured mixture models of topic correlations[C]//Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006:577-584.

相似文献/References:

[1]王超,刘奕群,马少平.搜索引擎点击模型综述[J].智能系统学报,2016,11(6):711.[doi:10.11992/tis.201605023]
 WANG Chao,LIU Yiqun,MA Shaoping.A survey of click models for Web browsing[J].CAAI Transactions on Intelligent Systems,2016,11(05):711.[doi:10.11992/tis.201605023]

备注/Memo

备注/Memo:
收稿日期:2017-07-01。
基金项目:国家自然科学基金重点项目(U1201258);山东省自然科学杰出青年基金项目(JQ201316);教育部人文社会科学研究项目(15YJAZH042).
作者简介:张森,男,1992年生,硕士研究生,主要研究方向为信息检索、自然语言处理;张晨,男,1988年生,副教授,博士,主要研究方向为众包、数据分析与数据挖掘、机器学习。在TKD、VLDB、SIGMOD、ICDE等国内外重要期刊和顶级学术会议上发表论文10余篇;林培光,男,1978年生,副教授,博士,主要研究方向为信息检索、海量数据处理和集成。主持教育部课题2项、山东省自然科学基金项目1项、济南市科技局自主创新计划1项和青年科技明星计划1项,另外参与国家自然科学基金以及省部级课题多项。发表学术论文30余篇,被SCI检索3篇,EI检索30余篇。
通讯作者:张晨.E-mail:zhangchen.sdufe@gmail.com
更新日期/Last Update: 2017-10-25