<-Previous Article Next Article->

[1]YANG Zhiyong,JIANG Feng,YU Xu,et al.Mixed data clustering initialization method using outlier detection technology[J].CAAI Transactions on Intelligent Systems,2023,18(1):56-65.[doi:10.11992/tis.202203031]

Copy

Mixed data clustering initialization method using outlier detection technology

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 18 Number of periods: 2023 1 Page number: 56-65 Column: 学术论文—机器感知与模式识别 Public date: 2023-01-05

Title:: Mixed data clustering initialization method using outlier detection technology

Author(s):: YANG Zhiyong; JIANG Feng; YU Xu; DU Junwei; School of Information Science & Technology, Qingdao University of Science and Technology, Qingdao 266100, China

Keywords:: initialization of clustering; mixed-type data; outlier detection; neighborhood rough set; granular neighborhood entropy; distance outlier factor; weighted density; weighted distance

CLC:: TP391

DOI:: 10.11992/tis.202203031

Abstract:: In recent years, the clustering problem of mixed-type data has received wide attention. As an effective method to process mixed-type data, K-prototype clustering algorithm usually uses the strategy of random selection to initialize cluster centers. However, it is difficult to guarantee the quality of clustering results in many practical applications. To solve above problem, in this paper we select initial centers for K-prototype algorithm based on outlier detection, and present a new initialization algorithm (Initialization of K-prototype Clustering Based on Outlier Detection and Density, denoted as IKP-ODD) for mixed-type data clustering. Given a candidate object, IKP-ODD determines whether the candidate object is an initial center by calculating its distance outlier factor, weighted density and weighted distances from existing initial centers. IKP-ODD prevents outliers from being selected as initial centers by using distance outlier factor and weighted density. When calculating the weighted densities of objects and the weighted distances between objects, we use the granular neighborhood entropy in neighborhood rough sets to calculate the significance of each attribute, and assign different weights to different attributes according to the significances of attributes, which can effectively reflect the difference between different attributes. Experiments on several UCI datasets show that IKP-ODD performs better than the existing initialization methods when solving the initialization problem of K-prototype clustering.

References:: [1] SEAL A, KARLEKAR A, KREJCAR O, et al. Fuzzy c-means clustering using Jeffreys-divergence based similarity measure[J]. Applied soft computing, 2020, 88: 106016.
[2] 常思源, 白晓征, 刘君. 一种基于聚类分析的二维激波模式识别算法[J]. 航空学报, 2020, 41(8): 162–175
CHANG Siyuan, BAI Xiaozheng, LIU Jun. A two-dimensional shock wave pattern recognition algorithm based on cluster analysis[J]. Acta aeronautica et astronautica sinica, 2020, 41(8): 162–175
[3] MOSLEHI F, HAERI A. A novel feature selection approach based on clustering algorithm[J]. Journal of statistical computation and simulation, 2021, 91(3): 581–604.
[4] 谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009–1024
XIE Juanying, DING Lijuan, WANG Mingzhao. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of software, 2020, 31(4): 1009–1024
[5] 路皓翔, 刘振丙, 张静, 等. 结合多尺度循环卷积和多聚类空间的红外图像增强[J]. 电子学报, 2022, 50(2): 415–425
LU Haoxiang, LIU Zhenbing, ZHANG Jing, et al. Infrared image enhancement based on multi-scale cyclic convolution and multi-clustering space[J]. Acta electronica sinica, 2022, 50(2): 415–425
[6] ZHANG Xiaofeng, SUN Yujuan, LIU Hui, et al. Improved clustering algorithms for image segmentation based on non-local information and back projection[J]. Information sciences, 2021, 550: 129–144.
[7] HUA Lei, GU Yi, GU Xiaoqing, et al. A novel brain MRI image segmentation method using an improved multi-view fuzzy c-means clustering algorithm[J]. Frontiers in neuroscience, 2021, 15: 662674.
[8] ZOU Quan, LIN Gang, JIANG Xingpeng, et al. Sequence clustering in bioinformatics: an empirical study[J]. Briefings in bioinformatics, 2020, 21(1): 1–10.
[9] TENG Haotian, YUAN Ye, BAR-JOSEPH Z. Clustering spatial transcriptomics data[J]. Bioinformatics, 2022, 38(4): 997–1004.
[10] HUANG Zhexue. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data mining and knowledge discovery, 1998, 2(3): 283–304.
[11] JIA Ziqi, SONG Ling. Weighted k-prototypes clustering algorithm based on the hybrid dissimilarity coefficient[J]. Mathematical problems in engineering, 2020, 2020: 1–13.
[12] GUO Dongwei, CHEN Yingjie, CHEN Jingwen. A K-prototypes algorithm based on adaptive determination of the initial centroids[C]//Proceedings of the 2018 10th International Conference on Machine Learning and Computing. New York: ACM, 2018: 116?121.
[13] 赵立江, 黄永青, 刘玉龙. 改进的混合属性数据聚类算法[J]. 计算机工程与设计, 2007, 28(20): 4850–4852
ZHAO Lijiang, HUANG Yongqing, LIU Yulong. Improved clustering algorithm for mixture data sets[J]. Computer engineering and design, 2007, 28(20): 4850–4852
[14] ZHOU Caiying, HUANG Longjun. The improvement of initial point selection method for fuzzy K-Prototype clustering algorithm[C]//2010 2nd International Conference on Education Technology and Computer. Shanghai: IEEE, 2010, 4: 549?552.
[15] Knorr E M, NG R T. Algorithms for mining distancebased outliers in large datasets[C]//Proceeding of the 24th International Conference on Very Large Data Bases. San Francisco: IBM Press, 1997: 219?222.
[16] HU Qinghua, YU Daren, LIU Jinfu, et al. Neighborhood rough set based heterogeneous feature subset selection[J]. Information sciences, 2008, 178(18): 3577–3594.
[17] HU Qinghua, LIU Jinfu, YU Daren. Mixed feature selection based on granulation and approximation[J]. Knowledge-based systems, 2008, 21(4): 294–304.
[18] HU Qinghua, YU Daren, XIE Zongxia. Neighborhood classifiers[J]. Expert systems with applications, 2008, 34(2): 866–876.
[19] HU Qinghua, ZHANG Lei, ZHANG D, et al. Measuring relevance between discrete and continuous features based on neighborhood mutual information[J]. Expert systems with applications, 2011, 38(9): 10737–10750.
[20] DOLATSHAH M, HADIAN, MINAEI-BIDGOLI B. Ball*-tree: efficient spatial indexing for constrained nearest-neighbor search in metric spaces[EB/OL]. (2015?11?02)[2022?03?17].https://arxiv.org/abs/1511.00628.
[21] 徐章艳, 刘作鹏, 杨炳儒, 等. 一个复杂度为max(O(| C| | U| ), O(| C^2| U/C| ))的快速属性约简算法[J]. 计算机学报, 2006, 29(3): 391–399
XU Zhangyan, LIU Zuopeng, YANG Bingru, et al. A quick attribute reduction algorithm with complexity of max(O(| C| | U| ), O(| C2| | U/C| ))[J]. Chinese journal of computers, 2006, 29(3): 391–399
[22] Bache K, Lichman M. UCI machine learning repository[EB/OL]. (2013?04?04) [2022?03?17]. http://archive.ics.uci.edu/ml.
[23] SAJIDHA S A, CHODNEKAR S P, DESIKAN K. Initial seed selection for K-modes clustering - A distance and density based approach[J]. Journal of king Saud university - computer and information sciences, 2021, 33(6): 693–701.
[24] CAO Fuyuan, LIANG Jiye, BAI Liang. A new initialization method for categorical data clustering[J]. Expert systems with applications, 2009, 36(7): 10223–10228.
[25] PENG Liwen, LIU Yongguo. Attribute weights-based clustering centres algorithm for initialising K-modes clustering[J]. Cluster computing, 2019, 22(3): 6171–6179.
[26] DINH D T, HUYNH V N. k-PbC: an improved cluster center initialization for categorical data clustering[J]. Applied intelligence, 2020, 50(8): 2610–2632.
[27] WU Shu, JIANG Qiangshan S, HUANG J Z. A new initialization method for clustering categorical data[C]//Proceeding of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Nanjing: Springer Press, 2007: 972?980.
[28] DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of machine learning research, 2006, 7: 1–30.

Similar References:

Memo

Last Update: 1900-01-01

Mixed data clustering initialization method using outlier detection technology PDF DownloadHTML

Memo

Mixed data clustering initialization method using outlier detection technology

PDF Download HTML