<-Previous Article Next Article->

[1]HE Ming,CHANG Mengmeng,LIU Guoyang,et al.Log mining and application based on sql-on-hadoop query engine[J].CAAI Transactions on Intelligent Systems,2017,12(5):717-728.[doi:10.11992/tis.201706016]

Copy

Log mining and application based on sql-on-hadoop query engine

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 12 Number of periods: 2017 5 Page number: 717-728 Column: 学术论文—智能系统 Public date: 2017-10-25

Title:: Log mining and application based on sql-on-hadoop query engine

Author(s):: HE Ming¹; CHANG Mengmeng¹; LIU Guoyang²; GU Chengxiang²; PENG Jike²; 1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China;
2. Information Technology Management Department, Haitong Securities Co., Ltd., Shanghai 200001, China

Keywords:: big data; log analysis; data mining; Hadoop; query engine; data collection; indexed storage; securities business

CLC:: TP391

DOI:: 10.11992/tis.201706016

Abstract:: With the rapid development of computing and networking technologies, and the increase in the number of data acquisition methods, the demand for real-time processing of massive amounts of log data is increasing every day, and there is a calculation bottleneck when traditional log analysis technology is used to process massive amounts of data. With the development of open processing platforms in the era of big data, a number of big data processing systems have emerged for dealing with large-scale and diverse data. To effectively apply the advantages of Hadoop to the original businesses, in this study, we first investigated network log analysis methods based on big data technology and constructed a network log analysis platform for the acquisition, analysis, storage, high-efficiency and flexible queries, and the calculation of trillions of log entries. In addition, we compared and analyzed three representative SQL-on-Hadoop query systems including Hive, Impala, and Spark SQL, and identified the performance characteristics of this type of system. We used the TPC-H testing reference to test and assess their decision-making support abilities. We drew some useful conclusions from the analysis of the experimental data. We also suggest a few typical applications for this analysis and processing system for massive log data in the securities fields, which provides a solid foundation for further research.

References:: [1] OLINER A, GANAPATHI A, XU W. Advances and challenges in log analysis[J]. Communications of the ACM, 2012, 55(2):55-61.
[2] 李国杰,程学旗. 大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J]. 中国科学院院刊,2012, 27(6):647-657.LI Guojie, CHENG Xueqi. Research status and scientific thinking of big data[J]. Bulletin of Chinese academy of sciences, 2012, 27(6):647-657.
[3] 王元卓,靳小龙,程学旗. 网络大数据:现状与展望[J]. 计算机学报, 2013, 36(6):1125-1138.WANG Yuanzhuo, JIN Xiaolong, CHENG Xueqi. Network big data:present and future[J]. Chinese journal of computer, 2013, 36(6):1125-1138.
[4] 孟小峰,慈祥. 大数据管理:概念、技术与挑战[J]. 计算机研究与发展, 2013, 50(1):146-149.MENG Xiaofeng, CI Xiang. Big data management:Concepts, techniques and challenges[J]. Journal of computer research and development, 2013, 50(1):146-149.
[5] JOSHI S B. Apache hadoop performance-tuning methodologies and best practices[C]//Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. New York, USA, 2012:241-242.
[6] LAMB W. The storyteller, the scribe, and a missing man:hidden influences from printed sources in the gaelic tales of duncan and neil macdonald[J]. Oral tradition, 2012, 27(1):109-160.
[7] Apache.org. Apache Chukwa[EB/OL].[2017-06-07].http://chukwa.apache.org/
[8] GOODHOPE K, KOSHY J, KREPS J, et al. Building LinkedIn’s real-time activity data pipeline[J]. Data engineering, 2012, 35(2):33-45.
[9] APACHE ORG. Apache Flume[EB/OL].[2017-06-07]. https://flume.apache.org.
[10] GHEMAWAAT S, GOBIOFF H, LEUNG S T. The Google file system[C]//Proc of the 19th ACM Symp on Operating Systems Principles. New York, USA, 2003:29-43.
[11] THUSOO A, SARMA J S, JAIN N, et al. Hive-a petabyte scale data warehouse using Hadoop[C]//Proc of 2010 IEEE 26th International Conference. Piscataway, NJ, 2010:996-1005.
[12] APACHE ORG. Apache HBase[EB/OL].[2017-06-07]. https://Hbase.apache.org.
[13] APACHE ORG. Hadoop Streaming[EB/OL].[2017-06-07].http://hadoop.apache.org/docs/r1.2.1/streaming.html.
[14] WEI J, ZHAO Y, JIANG K, et al. Analysis farm:A cloud-based scalable aggregation and query platform for network log analysis[C]//International Conference on Cloud and Service Computing. Hong Kong, China, 2011:354-359.
[15] RABKIN A, KATZ R H. Chukwa:a system for reliable large-scale log collection[C]//International Conference on Large Installation System Administration. New York,USA, 2010:163-177.
[16] LOGOTHETIS D, TREZZO C, WEBB K, et al. In-situ mapreduce for log processing[C]//Usenix Conference on Hot Topics in Cloud Computing. Berkeley, USA, 2012:26-26.
[17] TREZZO C J. Continuous mapreduce:an architecture for large-scale in-situ data processing[J]. Dissertations and theses-gradworks, 2010, 126(7):14.
[18] Apache.org. HDFS Architecture Guide[EB/OL].[2017-06-07]. http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.
[19] DEAN J, GHEMAWAT S. Mapreduce:simplified data processing on large culsters[C]//Proc of the 6th Symp on Operating System Design and Implementation. San Francisco, USA, 2004:137-150.
[20] HAN U G, AHN J. Dynamic load balancing method for apache flume log processing[C]//Information Science and Technology. Shenzhen, China, 2014:83-86.
[21] Apache.org. Apache sqoop[EB/OL].[2017-06-07]. http://sqoop.apache.org/.
[22] BITTORF M, BOBROVYTSKY T, ERICKSON CCACJ, et al. Impala:a modern, open-source SQL engine for Hadoop[C]//Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. CA, USA, 2015:4-7.
[23] FLORATOU A, MINHAS U F, OZCAN F. SQL-on-Hadoop:full circle back to shared-nothing database architectures[J]. Proc of the VLDB endowment, 2014, 7(12):1199-1208.
[24] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:cluster computing with working sets[J]. Book of extremes, 2010, 15(1):1765-1773.
[25] HE Y, LEE R, HUAI Y, et al. RCFile:a fast and space-efficient data placement structure in MapReduce-based warehouse systems.[C]//Proc of 27th IEEE Int Conf on Data Engineering. CA:IEEE Computer Society, 2011:1199-1208.
[26] MELNIK S, GUBAREV A, LONG J J, et al. Dremel:interactive analysis of web-scale datasets[J]. Communications of the Acm, 2011, 3(12):114-123.

Similar References:

Memo

Last Update: 2017-10-25

Log mining and application based on sql-on-hadoop query engine PDF DownloadHTML

Memo

Log mining and application based on sql-on-hadoop query engine

PDF Download HTML