XIE Juanying,ZHOU Ying,WANG Mingzhao,et al.New criteria for evaluating the validity of clustering[J].CAAI Transactions on Intelligent Systems,2017,(06):873-882.[doi:10.11992/tis.201706029]





New criteria for evaluating the validity of clustering
谢娟英 周颖 王明钊 姜炜亮
陕西师范大学 计算算计科学学院, 陕西 西安 710062
XIE Juanying ZHOU Ying WANG Mingzhao JIANG Weiliang
School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
聚类聚类有效性评价指标外部指标内部指标F-measureAdjusted Rand IndexSTDIS2PS2
clusteringvalidity of clusteringevaluation indexexternal criteriainternal criteriaF-measureAdjusted Rand IndexSTDIS2PS2
There are two kinds of criteria for evaluating the clustering ability of a clustering algorithm, internal and external. The current external evaluation indexes fails to consider the skewed clustering result; it is difficult to get optimum cluster numbers from the clustering validity inspection results from the internal evaluation indexes. Considering the defects in the present internal and external clustering evaluation indices, we propose two external evaluation indexes, which consider both positive and negative information and which are respectively based on the contingency table and sample pairs for the evaluation of clustering results from a dataset with arbitrary distribution. The variance is proposed to measure the tightness of a cluster and the separability between clusters, and the ratio of these parameters is used as an internal evaluation index for the measurement index. Experiments on the datesets from UCI (University of California in Iven) machine learning repository and artificially simulated datasets show that the proposed new internal index can be used to effectively find the truenumber of clusters in a dataset. The proposed external indexes based on the contingency table and sample pairs are a very effective external evaluation indexes and can be used to evaluate the clustering results from existing types of skewed and noisy data.


[1] ESTEVA A, KUPREL B, NOVOA RA, et al. Dermatologist-level classification of skin cancer with deep neural networks[J]. Nature, 2017, 542(7639): 115-118.
[2] FARINA D, VUJAKLIJA I, SARTORI M, et al. Man/machine interface based on the discharge timings of spinal motor neurons after targeted muscle reinnervation[J]. Nature biomedical engineering, 2017, 1: 25.
[3] GULSHAN V, PENG L, CORAM M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs[J]. JAMA, 2016, 316(22): 2402-2410.
[4] LONG E, LIN H, LIU Z, et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts[J]. Nature biomedical engineering, 2017, 1: 0024.
[5] ORRINGER DA, PANDIAN B, NIKNAFS Y S, et al. Rapid intraoperative histology of unprocessed surgical specimens via fibre-laser-based stimulated Raman scattering microscopy[J]. Nature biomedical engineering, 2017, 1: 0027.
[6] HAN J, PEI J, KAMBER M. Data mining: concepts and techniques[M]. Singapore: Elsevier, 2011.
[7] JAIN AK, DUBES RC. Algorithms for clustering data[M]. Prentice-Hall, 1988.
[8] DE SOUTO MCP, COELHO ALV, FACELI K, et al. A comparison of external clustering evaluation indices in the context of imbalanced data sets[C]//2012 Brazilian Symposium on Neural Networks (SBRN). [S.l.], 2012: 49-54.
[9] HUANG S, CHRNG Y, LANG D, et al. A formal algorithm for verifying the validity of clustering results based on model checking[J]. PloS one, 2014, 9(3): e90109.
[10] RENDÓN E, ABUNDEZ I, ARIZMENDI A, et al. Internal versus external cluster validation indexes[J]. International journal of computers and communications, 2011, 5(1): 27-34.
[11] ROSALES-MENDÉZ H, RAMÍREZ-CRUZ Y. CICE-BCubed: A new evaluation measure for overlapping clustering algorithms[C]//Iberoamerican Congress on Pattern Recognition. Berlin: Springer Berlin Heidelberg, 2013: 157-164.
[12] SAID AB, HADJIDJ R, FOUFOU S. Cluster validity index based on jeffrey divergence[J]. Pattern analysis and applications, 2017, 20(1): 21-31.
[13] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE transactions on systems, man, and cybernetics, part b (cybernetics), 2009, 39(2): 318-331.
[14] POWERS D M W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness and correlation[J]. Journal of machine learning technologies, 2011, 2: 2229-3981.
[15] LARSEN B, AONE C. Fast and effective text mining using linear-time document clustering[C]//Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA: ACM, 1999: 16-22.
[16] ZU EISSEN, B S S M, WIßBROCK F. On cluster validity and the information need of users[C]//Conference on Artificial Intelligence and Applications, Benalmádena, Spain, 2003. Calgary, Canada: ACTA Press, 2003: 216-221.
[17] 谢娟英. 无监督学习方法及其应用[M]. 北京: 电子工业出版社, 2016.
XIE Juanying, Unsupervised learning methods and applications[M]. Beijing: Publishing House of Electronics Industry, 2016.
[18] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information sciences, 2016, 354: 19-40.
[19] 谢娟英, 高红超, 谢维信. K 近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学: 信息科学, 2016, 46(2): 258-280.
XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia sinica informationis, 2016, 46(2): 258-280.
[20] AMIGÓ E, GONZALO J, ARTILES J, et al. A comparison of extrinsic clustering evaluation metrics based on formal constraints[J]. Information retrieval, 2009, 12(4): 461-486.
[21] VINH NX, EPPS J, BAILEY J. Information theoretic measures for clusterings comparison: is a correction for chance necessary [C]//Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 2009. New York, USA: ACM, 2009: 1073-1080.
[22] D’HAESELEER P. How does gene expression clustering work [J]. Nature biotechnology, 2005, 23(12): 1499.
[23] QUACKENBUSH J. Computational analysis of microarray data[J]. Nature reviews genetics, 2001, 2(6): 418-427.
[24] CHOU CH, SU MC, LAI E. A new cluster validity measure for clusters with different densities[C]//IASTED International Conference on Intelligent Systems and Control. Calgary, Canada: ACTA Press, 2003: 276-281.
[25] 谢娟英, 周颖. 一种新聚类评价指标[J].陕西师范大学学报: 自然科学版, 2015, 43(6): 1-8.
XIE Juanying, ZHOU Ying. A new criterion for clustering algorithm[J]. Journal of Shaanxi normal university: natural science edition, 2015, 43(6): 1-8.
[26] KAPP AV, TIBSHIRANI R. Are clusters found in one dataset present in another dataset[J]. Biostatistics, 2007, 8(1): 9-31.
[27] DAVIES DL, BOULDIN DW. A cluster separation measure[J]. IEEE transactions on pattern analysis and machine intelligence, 1979 (2): 224-227.
[28] HASHIMOTO W, NAKAMURA T, MIYAMOTO S. Comparison and evaluation of different cluster validity measures including their kernelization[J]. Journal of advanced computational intelligence and intelligent informatics, 2009, 13(3): 204-209.
[29] XIE XL, BENI G. A validity measure for fuzzy clustering[J]. IEEE transactions on pattern analysis and machine intelligence, 1991, 13(8): 841-847.
[30] ROUSSEEUW PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis[J]. Journal of computational and applied mathematics, 1987, 20: 53-65.
[31] 周世兵, 徐振源, 唐旭清. 一种基于近邻传播算法的最佳聚类数确定方法[J]. 控制与决策, 2011, 26(8): 1147-1152.
ZHOU Shibing, XU Zhenyuan, TANG Xuqing. Method for determining optimal number of clusters based on affinity propagation clustering[J]. Control and decision, 2011, 26(8): 1147-1152.
[32] 盛骤, 谢式千. 概率论与数理统计及其应用[M]. 北京: 高等教育出版社, 2004.
SHENG Zhou, XIE Shiqian. Probability and mathematical statistics and its application[M]. Beijing: Higher education press, 2004.
[33] LICHMAN M, UCI Machine learning repository[EB/OL]. 2013, University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
[34] 谢娟英, 高瑞. 方差优化初始中心的K-medoids聚类算法[J]. 计算机科学与探索, 2015, 9(8): 973-984.
XIE Juanying, GAO Rui. K-medoids clustering algorithms with optimized initial seeds by variance[J]. Journal of frontiers of computer science and technology, 2015, 9(8): 973-984.
[35] PARK HS, JUN CH. A simple and fast algorithm for K-medoids clustering[J]. Expert systems with applications, 2009, 36(2): 3336-3341.


 YANG Xiao-bing,HE Ling-min,KONG Fan-sheng.A noise-resistant clustering algorithm for switching regression models[J].CAAI Transactions on Intelligent Systems,2009,(06):497.[doi:10.3969/j.issn.1673-4785.2009.06.005]
[2]季瑞瑞,刘 丁.支持向量数据描述的基因表达数据聚类方法[J].智能系统学报,2009,(06):544.[doi:10.3969/j.issn.1673-4785.2009.06.013]
 JI Rui-rui,LIU Ding.Improved gene expression data clustering using a support vector domain description algorithm[J].CAAI Transactions on Intelligent Systems,2009,(06):544.[doi:10.3969/j.issn.1673-4785.2009.06.013]
 ZHANG Xiu-ling,PANG Zong-peng,LI Shao-qing,et al.A dynamic influence matrix method for flatness control based on adaptivenetworkbased fuzzy inference systems[J].CAAI Transactions on Intelligent Systems,2010,(06):360.
 LI Wei,YANG Xiaofeng,ZHANG Chongyang,et al.A clustering method for community detection on complex networks[J].CAAI Transactions on Intelligent Systems,2011,(06):57.
 CHEN Yuefeng,MIAO Duoqian,LI Wen,et al.Semantic orientation computing based on concepts[J].CAAI Transactions on Intelligent Systems,2011,(06):489.
 FANG Ran,MIAO Duoqian,ZHANG Zhifei.An emotion-based method of topic detection from Chinese microblogs[J].CAAI Transactions on Intelligent Systems,2013,(06):208.
 QING Ming,SUN Xiaomei.A new clustering effectiveness function: fuzzy entropy of fuzzy partition[J].CAAI Transactions on Intelligent Systems,2015,(06):75.[doi:10.3969/j.issn.1673-4785.201410004]
 LIU lian,CHANG Dongxia,DENG Yong.An image segmentation method based on dynamic niche artificial fish-swarm algorithm[J].CAAI Transactions on Intelligent Systems,2015,(06):669.[doi:10.11992/tis.201501001]
 LIU Beibei,MA Runing,DING Jundi.Density-based statistical merging clustering algorithm[J].CAAI Transactions on Intelligent Systems,2015,(06):712.[doi:10.11992/tis.201410028]
 ZHU Shuwei,ZHOU Zhiping,ZHANG Daowen.K-harmonic means clustering merged with parallel chaotic firefly algorithm[J].CAAI Transactions on Intelligent Systems,2015,(06):872.[doi:10.11992/tis.201505043]


更新日期/Last Update: 2018-01-03