[1]张森,张晨,林培光,等.基于用户查询日志的网络搜索主题分析[J].智能系统学报,2017,12(5):668-677.[doi:10.11992/tis.201706096]
ZHANG Sen,ZHANG Chen,LIN Peiguang,et al.Web search topic analysis based on user search query logs[J].CAAI Transactions on Intelligent Systems,2017,12(5):668-677.[doi:10.11992/tis.201706096]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
12
期数:
2017年第5期
页码:
668-677
栏目:
学术论文—自然语言处理与理解
出版日期:
2017-10-25
- Title:
-
Web search topic analysis based on user search query logs
- 作者:
-
张森1, 张晨1,2, 林培光1, 张春云1, 郭玉超1, 任威龙1, 任可2
-
1. 山东财经大学 计算机科学与技术学院, 山东 济南 250014;
2. 香港科技大学 计算机科学及工程学系, 香港 999077
- Author(s):
-
ZHANG Sen1, ZHANG Chen1,2, LIN Peiguang1, ZHANG Chunyun1, GUO Yuchao1, REN Weilong1, REN Ke2
-
1. School of Computer Science & Technology, Shandong University of Finance & Economics, Jinan 250014, China;
2. Department of Computer Science & Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China
-
- 关键词:
-
网络搜索; 搜索引擎; 自然语言处理; 主题模型; 文本挖掘; 突发性; 时间分析; 参数估计
- Keywords:
-
web search; search engine; natural language processing; topic model; data mining; burstiness; temporal analysis; parameter estimate
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.201706096
- 摘要:
-
网络搜索分析在优化搜索引擎方面具有举足轻重的作用,而且对用户个人搜索特性进行分析能够提高搜索引擎的精准度。目前,大多数已有模型(比如点击图模型及其变体),注重研究用户群体的共同特点。然而,关于如何做到既可以获取用户群体共同特点又可以获取用户个人特点方面的研究却非常少。本文研究了基于个人用户网络搜索分析新问题,即通过研究用户搜索的突发性现象,获取个人用户搜索查询的主题分布情况。提出了两个搜索主题模型,即搜索突发性模型(SBM)和耦合敏感搜索突发性模型(CS-SBM)。SBM假设查询词和URL主题是无关的,CS-SBM假设查询词和URL之间是有主题关联的,得到的主题分布信息存储在偏Dirichlet先验中,采用Beta分布刻画用户搜索的时间特性。实验结果表明,每一个用户的网络搜索轨迹都有多种基于用户的独有特点。同时,在使用大量真实用户查询日志数据情况下,与LDA、DCMLDA、TOT相比,本文提出的模型具有明显的泛化性能优势,并且有效地描绘了用户搜索查询主题在时间上的变化过程。
- Abstract:
-
Web search analysis plays a critical role in improving the performance of contemporary search engines. In addition, search engine accuracy can be improved by analyzing the individual search properties of users. Most existing models, such as the click graph and its variants, focus on the common characteristics of the group. However, as yet, there has been little investigation of a model that would obtain both the collective group characteristics and the unique characteristics of individual users. In this paper, we investigate user-specific web search analysis, whereby we obtain the topic distributions of the search queries of individual users by determining the burstiness of user searches. We propose two topic models, i.e., the search burstiness model (SBM) and the coupling-sensitive search burstiness model (CS-SBM). The SBM adopts the assumption that the query words and URL are topically independent, The CS-SBM supposes that the query words and URL are topically relevant. The obtained topic distribution information is stored in skewed Dirichlet priors and a beta distribution is used to capture the temporal properties of the user searches. Our experimental results show that each user’s web search trail has unique characteristics, and that in the case of there being a large amount of real query log data, in comparison to the latent Dirichlet allocation (LDA) and topic over time (TOT) models, our proposed models have advantages with respect to generalized performance and effectively describes the temporal change process of user search queries.
备注/Memo
收稿日期:2017-07-01。
基金项目:国家自然科学基金重点项目(U1201258); 教育部人文社会科学研究项目(15YJAZH042);
作者简介:张森,男,1992年生,硕士研究生,主要研究方向为信息检索、自然语言处理;张晨,男,1988年生,副教授,博士,主要研究方向为众包、数据分析与数据挖掘、机器学习。在TKD、VLDB、SIGMOD、ICDE等国内外重要期刊和顶级学术会议上发表论文10余篇;林培光,男,1978年生,副教授,博士,主要研究方向为信息检索、海量数据处理和集成。主持教育部课题2项、山东省自然科学基金项目1项、济南市科技局自主创新计划1项和青年科技明星计划1项,另外参与国家自然科学基金以及省部级课题多项。发表学术论文30余篇,被SCI检索3篇,EI检索30余篇。
通讯作者:张晨.E-mail:zhangchen.sdufe@gmail.com
更新日期/Last Update:
2017-10-25