[1]何明,常盟盟,刘郭洋,等.基于SQL-on-Hadoop查询引擎的日志挖掘及其应用[J].智能系统学报,2017,12(05):717-728.[doi:10.11992/tis.201706016]
 HE Ming,CHANG Mengmeng,LIU Guoyang,et al.Log mining and application based on sql-on-hadoop query engine[J].CAAI Transactions on Intelligent Systems,2017,12(05):717-728.[doi:10.11992/tis.201706016]
点击复制

基于SQL-on-Hadoop查询引擎的日志挖掘及其应用(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第12卷
期数:
2017年05期
页码:
717-728
栏目:
出版日期:
2017-10-25

文章信息/Info

Title:
Log mining and application based on sql-on-hadoop query engine
作者:
何明1 常盟盟1 刘郭洋2 顾程祥2 彭继克2
1. 北京工业大学 信息学部, 北京 100124;
2. 海通证券股份有限公司 信息技术管理部, 上海 200001
Author(s):
HE Ming1 CHANG Mengmeng1 LIU Guoyang2 GU Chengxiang2 PENG Jike2
1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China;
2. Information Technology Management Department, Haitong Securities Co., Ltd., Shanghai 200001, China
关键词:
大数据日志分析数据挖掘Hadoop查询引擎数据采集索引存储证券行业
Keywords:
big datalog analysisdata miningHadoopquery enginedata collectionindexed storagesecurities business
分类号:
TP391
DOI:
10.11992/tis.201706016
摘要:
随着计算机和网络技术的迅猛发展以及数据获取手段的不断丰富,海量数据的实时处理需求日益增多,传统的日志分析技术在处理海量数据时存在计算瓶颈。大数据时代下,随着开放式处理平台的发展,能够处理大规模且多样化数据的大数据处理系统应运而生。为了让原有的业务能够充分利用Hadoop的优势,本文首先研究了基于大数据技术的网络日志分析方法,构建了网络日志分析平台以实现万亿级日志采集、解析、存储和高效、灵活的查询与计算。对比分析了Hive、Impala和Spark SQL这3种具有代表性的SQL-on-Hadoop查询系统实例,并展示了这类系统的性能特点。采用TPC-H测试基准对它们的决策支持能力进行测试及评估,通过对实验数据的分析和解释得到了若干有益的结论。实现了海量日志数据计算与分析在证券领域的几种典型应用,为进一步的研究工作奠定了基础。
Abstract:
With the rapid development of computing and networking technologies, and the increase in the number of data acquisition methods, the demand for real-time processing of massive amounts of log data is increasing every day, and there is a calculation bottleneck when traditional log analysis technology is used to process massive amounts of data. With the development of open processing platforms in the era of big data, a number of big data processing systems have emerged for dealing with large-scale and diverse data. To effectively apply the advantages of Hadoop to the original businesses, in this study, we first investigated network log analysis methods based on big data technology and constructed a network log analysis platform for the acquisition, analysis, storage, high-efficiency and flexible queries, and the calculation of trillions of log entries. In addition, we compared and analyzed three representative SQL-on-Hadoop query systems including Hive, Impala, and Spark SQL, and identified the performance characteristics of this type of system. We used the TPC-H testing reference to test and assess their decision-making support abilities. We drew some useful conclusions from the analysis of the experimental data. We also suggest a few typical applications for this analysis and processing system for massive log data in the securities fields, which provides a solid foundation for further research.

参考文献/References:

[1] OLINER A, GANAPATHI A, XU W. Advances and challenges in log analysis[J]. Communications of the ACM, 2012, 55(2):55-61.
[2] 李国杰,程学旗. 大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J]. 中国科学院院刊,2012, 27(6):647-657.LI Guojie, CHENG Xueqi. Research status and scientific thinking of big data[J]. Bulletin of Chinese academy of sciences, 2012, 27(6):647-657.
[3] 王元卓,靳小龙,程学旗. 网络大数据:现状与展望[J]. 计算机学报, 2013, 36(6):1125-1138.WANG Yuanzhuo, JIN Xiaolong, CHENG Xueqi. Network big data:present and future[J]. Chinese journal of computer, 2013, 36(6):1125-1138.
[4] 孟小峰,慈祥. 大数据管理:概念、技术与挑战[J]. 计算机研究与发展, 2013, 50(1):146-149.MENG Xiaofeng, CI Xiang. Big data management:Concepts, techniques and challenges[J]. Journal of computer research and development, 2013, 50(1):146-149.
[5] JOSHI S B. Apache hadoop performance-tuning methodologies and best practices[C]//Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. New York, USA, 2012:241-242.
[6] LAMB W. The storyteller, the scribe, and a missing man:hidden influences from printed sources in the gaelic tales of duncan and neil macdonald[J]. Oral tradition, 2012, 27(1):109-160.
[7] Apache.org. Apache Chukwa[EB/OL].[2017-06-07].http://chukwa.apache.org/
[8] GOODHOPE K, KOSHY J, KREPS J, et al. Building LinkedIn’s real-time activity data pipeline[J]. Data engineering, 2012, 35(2):33-45.
[9] APACHE ORG. Apache Flume[EB/OL].[2017-06-07]. https://flume.apache.org.
[10] GHEMAWAAT S, GOBIOFF H, LEUNG S T. The Google file system[C]//Proc of the 19th ACM Symp on Operating Systems Principles. New York, USA, 2003:29-43.
[11] THUSOO A, SARMA J S, JAIN N, et al. Hive-a petabyte scale data warehouse using Hadoop[C]//Proc of 2010 IEEE 26th International Conference. Piscataway, NJ, 2010:996-1005.
[12] APACHE ORG. Apache HBase[EB/OL].[2017-06-07]. https://Hbase.apache.org.
[13] APACHE ORG. Hadoop Streaming[EB/OL].[2017-06-07].http://hadoop.apache.org/docs/r1.2.1/streaming.html.
[14] WEI J, ZHAO Y, JIANG K, et al. Analysis farm:A cloud-based scalable aggregation and query platform for network log analysis[C]//International Conference on Cloud and Service Computing. Hong Kong, China, 2011:354-359.
[15] RABKIN A, KATZ R H. Chukwa:a system for reliable large-scale log collection[C]//International Conference on Large Installation System Administration. New York,USA, 2010:163-177.
[16] LOGOTHETIS D, TREZZO C, WEBB K, et al. In-situ mapreduce for log processing[C]//Usenix Conference on Hot Topics in Cloud Computing. Berkeley, USA, 2012:26-26.
[17] TREZZO C J. Continuous mapreduce:an architecture for large-scale in-situ data processing[J]. Dissertations and theses-gradworks, 2010, 126(7):14.
[18] Apache.org. HDFS Architecture Guide[EB/OL].[2017-06-07]. http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.
[19] DEAN J, GHEMAWAT S. Mapreduce:simplified data processing on large culsters[C]//Proc of the 6th Symp on Operating System Design and Implementation. San Francisco, USA, 2004:137-150.
[20] HAN U G, AHN J. Dynamic load balancing method for apache flume log processing[C]//Information Science and Technology. Shenzhen, China, 2014:83-86.
[21] Apache.org. Apache sqoop[EB/OL].[2017-06-07]. http://sqoop.apache.org/.
[22] BITTORF M, BOBROVYTSKY T, ERICKSON CCACJ, et al. Impala:a modern, open-source SQL engine for Hadoop[C]//Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. CA, USA, 2015:4-7.
[23] FLORATOU A, MINHAS U F, OZCAN F. SQL-on-Hadoop:full circle back to shared-nothing database architectures[J]. Proc of the VLDB endowment, 2014, 7(12):1199-1208.
[24] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:cluster computing with working sets[J]. Book of extremes, 2010, 15(1):1765-1773.
[25] HE Y, LEE R, HUAI Y, et al. RCFile:a fast and space-efficient data placement structure in MapReduce-based warehouse systems.[C]//Proc of 27th IEEE Int Conf on Data Engineering. CA:IEEE Computer Society, 2011:1199-1208.
[26] MELNIK S, GUBAREV A, LONG J J, et al. Dremel:interactive analysis of web-scale datasets[J]. Communications of the Acm, 2011, 3(12):114-123.

相似文献/References:

[1]夏 虎,傅 彦,方育柯,等.一种自反馈垃圾信息综合过滤方法[J].智能系统学报,2010,5(02):117.
 XIA Hu,FU Yan,FANG Yu-ke,et al.A selffeedback synthesis method for spam filtering[J].CAAI Transactions on Intelligent Systems,2010,5(05):117.
[2]辛雨璇,闫子飞.基于手绘草图的图像检索技术研究进展[J].智能系统学报,2015,10(02):167.[doi:10.3969/j.issn.1673-4785.201401045]
 XIN Yuxuan,YAN Zifei.Research progress of image retrieval based on hand-drawn sketches[J].CAAI Transactions on Intelligent Systems,2015,10(05):167.[doi:10.3969/j.issn.1673-4785.201401045]
[3]王德文,孙志伟.一种基于内存计算的电力用户聚类分析方法[J].智能系统学报,2015,10(04):569.[doi:10.3969/j.issn.1673-4785.201411011]
 WANG Dewen,SUN Zhiwei.A method for cluster analysis of electric power consumers based on in-memory computing[J].CAAI Transactions on Intelligent Systems,2015,10(05):569.[doi:10.3969/j.issn.1673-4785.201411011]
[4]申彦,朱玉全.CMP上基于数据集划分的K-means多核优化算法[J].智能系统学报,2015,10(04):607.[doi:10.3969/j.issn.1673-4785.201411036]
 SHEN Yan,ZHU Yuquan.An optimized algorithm of K-means based on data set partition on CMP systems[J].CAAI Transactions on Intelligent Systems,2015,10(05):607.[doi:10.3969/j.issn.1673-4785.201411036]
[5]黄河燕,曹朝,冯冲.大数据情报分析发展机遇及其挑战[J].智能系统学报,2016,11(6):719.[doi:10.11992/tis.201610025]
 HUANG Heyan,CAO Zhao,FENG Chong.Opportunities and challenges of big data intelligence analysis[J].CAAI Transactions on Intelligent Systems,2016,11(05):719.[doi:10.11992/tis.201610025]
[6]马世龙,乌尼日其其格,李小平.大数据与深度学习综述[J].智能系统学报,2016,11(6):728.[doi:10.11992/tis.201611021]
 MA Shilong,WUNIRI Qiqige,LI Xiaoping.Deep learning with big data: state of the art and development[J].CAAI Transactions on Intelligent Systems,2016,11(05):728.[doi:10.11992/tis.201611021]
[7]苗夺谦,张清华,钱宇华,等.从人类智能到机器实现模型——粒计算理论与方法[J].智能系统学报,2016,11(6):743.[doi:10.11992/tis.201612014]
 MIAO Duoqian,ZHANG Qinghua,QIAN Yuhua,et al.From human intelligence to machine implementation model: theories and applications based on granular computing[J].CAAI Transactions on Intelligent Systems,2016,11(05):743.[doi:10.11992/tis.201612014]
[8]严新平,柳晨光.智能航运系统的发展现状与趋势[J].智能系统学报,2016,11(6):807.[doi:10.11992/tis.201605007]
 YAN Xinping,LIU Chenguang.Review and prospect for intelligent waterway transportation system[J].CAAI Transactions on Intelligent Systems,2016,11(05):807.[doi:10.11992/tis.201605007]
[9]许立波,潘旭伟,袁平,等.知识智能涌现创新:概念、体系与路径[J].智能系统学报,2017,12(01):47.[doi:10.11992/tis.201610014]
 XU Libo,PAN Xuwei,YUAN Ping,et al.Knowledge innovation by intelligent emergence—concept, framework and its pathway[J].CAAI Transactions on Intelligent Systems,2017,12(05):47.[doi:10.11992/tis.201610014]
[10]马钰,张岩,王宏志,等.面对智能导诊的个性化推荐算法[J].智能系统学报,2018,13(03):352.[doi:10.11992/tis.201711036]
 MA Yu,ZHANG Yan,WANG Hongzhi,et al.A personalized recommendation algorithm for intelligent guidance[J].CAAI Transactions on Intelligent Systems,2018,13(05):352.[doi:10.11992/tis.201711036]

备注/Memo

备注/Memo:
收稿日期:2017-06-07。
基金项目:国家自然科学基金项目(91646201,91546111,60803086);国家科技支撑计划子课题(2013BAH21B02-01);北京市自然科学基金项目(4153058,4113076);北京市教委重点项目(KZ20160005009);北京市教委面上项目(KM201710005023).
作者简介:何明,男,1975年生,博士,主要研究方向为大数据、推荐系统、机器学习;常盟盟,男,1987年生,硕士研究生,主要研究方向为数据挖掘、机器学习;刘郭洋,男,1986年生,硕士研究生,主要研究方向为大数据、数据挖掘。
通讯作者:何明.E-mail:heming@bjut.edu.cn
更新日期/Last Update: 2017-10-25