[1]严远亭,吴亚亚,赵姝,等.构造性覆盖下不完整数据修正填充方法[J].智能系统学报,2019,14(06):1225-1232.[doi:10.11992/tis.201906015]
 YAN Yuanting,WU Yaya,ZHAO Shu,et al.Improving missing data recovery with a constructive covering algorithm[J].CAAI Transactions on Intelligent Systems,2019,14(06):1225-1232.[doi:10.11992/tis.201906015]
点击复制

构造性覆盖下不完整数据修正填充方法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第14卷
期数:
2019年06期
页码:
1225-1232
栏目:
出版日期:
2019-11-05

文章信息/Info

Title:
Improving missing data recovery with a constructive covering algorithm
作者:
严远亭 吴亚亚 赵姝 张燕平
安徽大学 计算机科学与技术学院, 安徽 合肥 230601
Author(s):
YAN Yuanting WU Yaya ZHAO Shu ZHANG Yanping
School of Computer Science and Technology, Anhui University, Hefei 230601, China
关键词:
不完整数据缺失值填充邻域信息数据挖掘机器学习填充方法单一填充多重填充
Keywords:
incomplete datamissing value imputationneighborhood informationdata-miningmachine learningimputation methodsingle imputationmultiple imputation
分类号:
TP18
DOI:
10.11992/tis.201906015
摘要:
不完整数据处理是数据挖掘、机器学习等领域中的重要问题,缺失值填充是处理不完整数据的主流方法。当前已有的缺失值填充方法大多运用统计学和机器学习领域的相关技术来分析原始数据中的剩余信息,从而得到较为合理的值来替代缺失部分。缺失值填充大致可以分为单一填充和多重填充,这些填充方法在不同的场景下有着各自的优势。但是,很少有方法能进一步考虑样本空间分布中的邻域信息,并以此对缺失值的填充结果进行修正。鉴于此,本文提出了一种可广泛应用于诸多现有填充方法的框架用以提升现有方法的填充效果,该框架由预填充、空间邻域信息挖掘和修正填充三部分构成。本文对7种填充方法在8个UCI数据集上进行了实验,实验结果验证了本文所提框架的有效性和鲁棒性。
Abstract:
Incomplete data processing is one of the most active avenues in the fields of data mining, machine learning, etc. Missing value imputation is the mainstream method used to deal with incomplete data. At present, most existing missing value imputation methods utilize relevant techniques in the field of statistics and machine learning to analyze surplus information from original data to replace the missing attributes with plausible values. Missing value imputation can be roughly divided into single imputation and multiple imputation, which have their own advantages in different scenarios. However, there are few methods that can further consider neighborhood information in the spatial distribution of samples and modify the filling results of missing values. In view of this, this paper proposes a new framework that can be widely used in many existing imputation methods to enhance the imputation effect of existing methods. It is composed of three modules, called pre-filling, spatial neighborhood information mining, and modification of the results of pre-filling separately. In this paper, seven existing imputation methods were evaluated on eight UCI datasets. Experimental results verified the validity and robustness of the framework proposed in this paper.

参考文献/References:

[1] LARRA?AGA P, CALVO B, SANTANA R, et al. Machine learning in bioinformatics[J]. Briefings in bioinformatics, 2006, 7(1):86-112.
[2] HARPER P R. A review and comparison of classification algorithms for medical decision making[J]. Health policy, 2005, 71(3):315-331.
[3] SEBASTIANI F. Machine learning in automated text categorization[J]. ACM computing surveys, 2002, 34(1):1-47.
[4] KONG S G, HEO J, ABIDI B R, et al. Recent advances in visual and infrared face recognition-a review[J]. Computer vision and image understanding, 2005, 97(1):103-135.
[5] FU Xiao, REN Yinzi, YANG Guiqiu, et al. A computational model for heart failure stratification[C]//Proceedings of 2011 IEEE Computing in Cardiology. Hangzhou, China, 2011:385-388.
[6] FIALHO A S, KAYMAK U, ALMEIDA R J, et al. Probabilistic fuzzy prediction of mortality in intensive care units[C]//Proceedings of 2012 IEEE International Conference on Fuzzy Systems. Brisbane, Australia, 2012:1-8.
[7] AITTOKALLIO T. Dealing with missing values in large-scale studies:microarray data imputation and beyond[J]. Briefings in bioinformatics, 2010, 11(2):253-264.
[8] DE SOUTO M C P, JASKOWIAK P A, COSTA I G. Impact of missing data imputation methods on gene expression clustering and classification[J]. BMC bioinformatics, 2015, 16:64.
[9] LIU Siyuan, CHEN Lei, NI L M. Anomaly detection from incomplete data[J]. ACM transactions on knowledge discovery from data, 2014, 9(2):11.
[10] LIU Ji, MUSIALSKI P, WONKA P, et al. Tensor completion for estimating missing values in visual data[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(1):208-220.
[11] LAKSHMINARAYAN K, HARP S A, SAMAD T. Imputation of missing data in industrial databases[J]. Applied intelligence, 1999, 11(3):259-275.
[12] SONG Qinhao, SHEPPERD M, CHEN Xiangru, et al. Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation[J]. Journal of systems and software, 2008, 81(12):2361-2370.
[13] DONDERS A R T, VAN DER HEIJDEN G J M G, STIJNEN T, et al. Review:a gentle introduction to imputation of missing values[J]. Journal of clinical epidemiology, 2006, 59(10):1087-1091.
[14] TROYANSKAYA O, CANTOR M, SHERLOCK G, et al. Missing value estimation methods for DNA microarrays[J]. Bioinformatics, 2001, 17(6):520-525.
[15] KEERIN P, KURUTACH W, BOONGOEN T. Cluster-based KNN missing value imputation for DNA microarray data[C]//Proceedings of 2012 IEEE International Conference on Systems, Man, and Cybernetics. Seoul, South Korea, 2012:445-450.
[16] VAN BUUREN S, GROOTHUIS-OUDSHOORN K. Mice:Multivariate imputation by chained equations in R[J]. Journal of statistical software, 2011, 45(3):75765.
[17] GEBREGZIABHER M, DESANTIS S M. Latent class based multiple imputation approach for missing categorical data[J]. Journal of statistical planning and inference, 2010, 140(11):3252-3262.
[18] VERMUNT J K, VAN GINKEL J R, VAN DER ARK L A, et al. 9. Multiple imputation of incomplete categorical data using latent class analysis[J]. Sociological methodology, 2008, 38(1):369-397.
[19] TOUTENBURG H. Rubin, D.B.:multiple imputation for nonresponse in surveys[J]. Statistical papers, 1990, 31(1):180.
[20] SIM J M, KWON O, LEE K C. Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets[J]. Expert systems with applications, 2016, 46:485-493.
[21] 张铃, 张钹. M-P神经元模型的几何意义及其应用[J]. 软件学报, 1998, 9(5):334-338 ZHANG Ling, ZHANG Bo. A geometrical representation of M-P neural model and its applications[J]. Journal of software, 1998, 9(5):334-338
[22] 张燕平, 张铃. 机器学习理论与算法[M]. 北京:科学出版社, 2012:56-66.
[23] J?RNSTEN R, WANG Huiyu, WELSH W J, et al. DNA microarray data imputation and significance analysis of differential expression[J]. Bioinformatics, 2005, 21(22):4155-4161.
[24] MAZUMDER R, HASTIE T, TIBSHIRANI R. Spectral regularization algorithms for learning large incomplete matrices[J]. The journal of machine learning research, 2010, 11:2287-2322.
[25] RANJBAR M, MORADI P, AZAMI M, et al. An imputation-based matrix factorization method for improving accuracy of collaborative filtering systems[J]. Engineering applications of artificial intelligence, 2015, 46:58-66.

备注/Memo

备注/Memo:
收稿日期:2019-06-06。
基金项目:国家自然科学基金项目(61806002,61872002,61673020,61876001,61602003);安徽省自然科学基金项目(1708085QF143,1808085MF197);安徽大学博士科研启动基金项目(J01003253).
作者简介:严远亭,男,1986年生,讲师,博士,中国人工智能学会会员,主要研究方向为机器学习、粒计算和生物信息学。主持国家自然科学基金青年项目1项,发表学术论文10余篇;吴亚亚,男,1995年生,硕士研究生,中国人工智能学会会员,主要研究方向为机器学习和不完整数据处理;赵姝,女,1979年生,教授,博士生导师,博士,中国人工智能学会粒计算与知识发现专委会委员,安徽省人工智能学会常务理事,主要研究方向为机器学习、粒计算。获得发明专利和软件著作权多项,发表学术论文60余篇。
通讯作者:张燕平.E-mail:zhangyp2@gmail.com
更新日期/Last Update: 2019-12-25