WANG Chunkai,ZHUANG Fuzhen,SHI Zhongzhi.System resource allocation for variable data streams[J].CAAI Transactions on Intelligent Systems,2019,14(6):1278-1285.[doi:10.11992/tis.201908011]





System resource allocation for variable data streams
王春凯12 庄福振2 史忠植2
1. 中国再保险(集团)股份有限公司 博士后科研工作站, 北京 100033;
2. 中国科学院 计算技术研究所, 北京 100190
WANG Chunkai12 ZHUANG Fuzhen2 SHI Zhongzhi2
1. Post-doctoral Research Center, China Reinsurance (Group) Corporation, Beijing 100033, China;
2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
large-scale data stream management systemvariable data streamincremental learningmodel predictionparameter configurationmini-batch processingsystem performanceoutlier detection
A large-scale data stream management system (LSDSMS) usually contains a relational query system (RQS) and a stream processing system (SPS). When users submit queries to the RQS, it is often necessary to set system parameters according to the rate and distribution of the data streams. However, because of the variability of data streams, changing the resource allocation often reduces the performance of the LSDSMS. In view this problem, we propose a framework for automating the characterization deployment in the LSDSMS OrientStream+. First, based on a user-defined query latency threshold, we designed a data stream transmission mechanism for a mini-batch scheme. Then, we introduced a multi-level pipeline cache for processing batch data streams in the same configuration and obtained accurate query results using the timestamp of the data streams. We also propose an incremental leaning technique with outlier detection to improve the prediction accuracy of OrientStream+. Finally, we validated the proposed approach on the open-source SPS–Storm. Our experimental results show that OrientStream+ can reduce processing latency and improve the LSDSMS throughput.


[1] 孙大为, 张广艳, 郑纬民. 大数据流式计算:关键技术及系统实例[J]. 软件学报, 2014, 25(4):839-862 SUN Dawei, ZHANG Guangyan, ZHENG Weimin. Big data stream computing:technologies and instances[J]. Journal of software, 2014, 25(4):839-862
[2] 崔星灿, 禹晓辉, 刘洋, 等. 分布式流处理技术综[J]. 计算机研究与发展, 2015, 52(2):318-332 CUI Xingcan, YU Xiaohui, LIU Yang, et al. Distributed stream processing:a survey[J]. Journal of computer research and development, 2015, 52(2):318-332
[3] 王春凯, 孟小峰. 分布式数据流关系查询技术研究[J]. 计算机学报, 2016, 39(1):80-96 WANG Chunkai, MENG Xiaofeng. Relational query techniques for distributed data stream:a survey[J]. Chinese journal of computers, 2016, 39(1):80-96
[4] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. Snowbird, Utah, USA, 2014:147-156.
[5] ZHENG Yu, ZHANG Lizhu, XIE Xing, et al. Mining interesting locations and travel sequences from GPS trajectories[C]//Proceedings of the 18th International Conference on World Wide Web. Madrid, Spain, 2009:791-800.
[6] WANG Chunkai, MENG Xiaofeng, GUO Qi, et al. OrientStream:a framework for dynamic resource allocation in distributed data stream management systems[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. Indianapolis, Indiana, USA, 2016:2281-2286.
[7] WANG Chunkai, MENG Xiaofeng, GUO Qi, et al. Automating characterization deployment in distributed data stream management systems[J]. IEEE transactions on knowledge and data engineering, 2017, 29(12):2669-2681.
[8] SAX M J, CASTELLANOS M, CHEN Qiming, et al. Aeolus:an optimizer for distributed intra-node-parallel streaming systems[C]//Proceedings of 2013 IEEE 29th International Conference on Data Engineering. Brisbane, Australia, 2013:1280-1283.
[9] FU T Z J, DING Jianbing, MA R T B, et al. DRS:dynamic resource scheduling for real-time analytics over fast streams[C]//Proceedings of 2015 IEEE 35th International Conference on Distributed Computing Systems. Columbus, OH, USA, 2015:411-420.
[10] BITRAN G R, MORABITO R. State-of-the-art survey:open queueing networks:optimization and performance evaluation models for discrete manufa cturing systems[J]. Production and operations management, 1996, 5(2):163-193.
[11] ANIELLO L, BALDONI R, QUERZONI L. Adaptive online scheduling in storm[C]//Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems. Arlington, Texas, USA, 2013:207-218.
[12] KHOSHKBARFOROUSHHA A, RANJAN R, GAIRE R, et al. Resource usage estimation of data stream processing workloads in datacenter clouds[J]. arXiv:1501.07020, 2015.
[13] BISHOP C M. Mixture density networks[R]. Birmingham, UK:Aston University, 1994.
[14] POGGI N, CARRERA D, CALL A, et al. ALOJA:a systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness[C]//Proceedings of 2014 IEEE International Conference on Big Data. Washington, DC, USA, 2014:905-913.
[15] Apache Hadoop[EB/OL].[2019-04-20]. http://hadoop.apache.org/.
[16] BERRAL J L, POGGI N, CARRERA D, et al. ALOJA-ML:a framework for automating characterization and knowledge discovery in hadoop deployments[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, NSW, Australia, 2015:1701-1710.
[17] JAMSHIDI P, CASALE G. An uncertainty-aware approach to optimal configuration of stream processing systems[C]//Proceedings of 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. London, UK, 2016:39-48.
[18] VAN AKEN D, PAVLO A, GORDON G J, et al. Automatic database management system tuning through large-scale machine learning[C]//Proceedings of the 2017 ACM International Conference on Management of Data. Chicago, Illinois, USA, 2017:1009-1024.
[19] ABADI M, AGARWAL A, BARHAM P, et al. TensorFlow:large-scale machine learning on heterogeneous distributed systems[J]. arXiv:1603.04467, 2016.
[20] LI Jiexing, KÖNIG A C, NARASAYYA V, et al. Robust estimation of resource consumption for SQL queries using statistical techniques[J]. Proceedings of the VLDB endowment, 2012, 5(11):1555-1566.
[21] AKDERE M, ÇETINTEMEL U, RIONDATO M, et al. Learning-based query performance modeling and prediction[C]//Proceedings of 2012 IEEE 28th International Conference on Data Engineering. Washington, DC, USA, 2012:390-401.
[22] Kafka[EB/OL].[2019-04-20]. http://kafka.apache.org/.
[23] SAX M J, CASTELLANOS M. Building a transparent batching layer for storm. HPL-2013-69[R]. Palo Alto, California, USA:HP Labs, 2014.
[24] JOHN G H, LANGLEY P. Estimating continuous distributions in Bayesian classifiers[C]//Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montréal, Qué, Canada, 1995:338-345.
[25] HOEFFDING W. Probability inequalities for sums of bounded random variables[J]. Journal of the American statistical association, 1963, 58(301):13-30.
[26] OZA N C, RUSSELL S. Experimental comparisons of online and batch versions of bagging and boosting[C]//Proceedings of the Seventh ACM SIGKDD International conference on Knowledge Discovery and Data Mining. San Francisco, California, USA, 2001:359-364.
[27] AHA D W, KIBLER D, ALBERT M K. Instance-based learning algorithms[J]. Machine learning, 1991, 6(1):37-66.
[28] HiBench[EB/OL].[2019-08-10]. https://github.com/intel-hadoop/HiBench/.
[29] TPC-H. TPC-H is a decision support benchmark[EB/OL].[2019-08-10]. http://www.tpc.org/tpch.


作者简介:王春凯,男,1981年生,博士后,主要研究方向为数据流管理、知识融合。曾主持和参与中国博士后科学基金项目、国家重点研发计划项目、国家自然科学基金项目以及其他横向课题的研究。发表学术论文10余篇;庄福振,男,1983年生,副研究员。主要研究方向为迁移学习、数据挖掘、机器学习。曾主持和参与国家重点研发计划项目、国家"863 "计划项目、" 973"子课题、国家自然科学基金项目以及其他横向课题的研究。发表学术论文40余篇;史忠植,男,1941年生,研究员。主要研究方向为智能科学、人工智能、机器学习、知识工程等。1979年、1998年、2001年均获中国科学院科技进步二等奖,1994年获中国科学院科技进步特等奖,2002年获国家科技进步二等奖。发表学术论文400余篇,出版专著5部
更新日期/Last Update: 2019-12-25