[1]谢娟英,周颖,王明钊,等.聚类有效性评价新指标[J].智能系统学报,2017,12(06):873-882.[doi:10.11992/tis.201706029]
 XIE Juanying,ZHOU Ying,WANG Mingzhao,et al.New criteria for evaluating the validity of clustering[J].CAAI Transactions on Intelligent Systems,2017,12(06):873-882.[doi:10.11992/tis.201706029]
点击复制

聚类有效性评价新指标(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第12卷
期数:
2017年06期
页码:
873-882
栏目:
出版日期:
2017-12-25

文章信息/Info

Title:
New criteria for evaluating the validity of clustering
作者:
谢娟英 周颖 王明钊 姜炜亮
陕西师范大学 计算算计科学学院, 陕西 西安 710062
Author(s):
XIE Juanying ZHOU Ying WANG Mingzhao JIANG Weiliang
School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
关键词:
聚类聚类有效性评价指标外部指标内部指标F-measureAdjusted Rand IndexSTDIS2PS2
Keywords:
clusteringvalidity of clusteringevaluation indexexternal criteriainternal criteriaF-measureAdjusted Rand IndexSTDIS2PS2
分类号:
TP108
DOI:
10.11992/tis.201706029
摘要:
聚类有效性评价指标分为外部评价指标和内部评价指标两大类。现有外部评价指标没有考虑聚类结果类偏斜现象;现有内部评价指标的聚类有效性检验效果难以得到最佳类簇数。针对现有内外部聚类评价指标的缺陷,提出同时考虑正负类信息的分别基于相依表和样本对的外部评价指标,用于评价任意分布数据集的聚类结果;提出采用方差度量类内紧密度和类间分离度,以类间分离度与类内紧密度之比作为度量指标的内部评价指标。UCI数据集和人工模拟数据集实验测试表明,提出的新内部评价指标能有效发现数据集的真实类簇数;提出的基于相依表和样本对的外部评价指标,可有效评价存在类偏斜与噪音数据的聚类结果。
Abstract:
There are two kinds of criteria for evaluating the clustering ability of a clustering algorithm, internal and external. The current external evaluation indexes fails to consider the skewed clustering result; it is difficult to get optimum cluster numbers from the clustering validity inspection results from the internal evaluation indexes. Considering the defects in the present internal and external clustering evaluation indices, we propose two external evaluation indexes, which consider both positive and negative information and which are respectively based on the contingency table and sample pairs for the evaluation of clustering results from a dataset with arbitrary distribution. The variance is proposed to measure the tightness of a cluster and the separability between clusters, and the ratio of these parameters is used as an internal evaluation index for the measurement index. Experiments on the datesets from UCI (University of California in Iven) machine learning repository and artificially simulated datasets show that the proposed new internal index can be used to effectively find the truenumber of clusters in a dataset. The proposed external indexes based on the contingency table and sample pairs are a very effective external evaluation indexes and can be used to evaluate the clustering results from existing types of skewed and noisy data.

参考文献/References:

[1] ESTEVA A, KUPREL B, NOVOA RA, et al. Dermatologist-level classification of skin cancer with deep neural networks[J]. Nature, 2017, 542(7639): 115-118.
[2] FARINA D, VUJAKLIJA I, SARTORI M, et al. Man/machine interface based on the discharge timings of spinal motor neurons after targeted muscle reinnervation[J]. Nature biomedical engineering, 2017, 1: 25.
[3] GULSHAN V, PENG L, CORAM M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs[J]. JAMA, 2016, 316(22): 2402-2410.
[4] LONG E, LIN H, LIU Z, et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts[J]. Nature biomedical engineering, 2017, 1: 0024.
[5] ORRINGER DA, PANDIAN B, NIKNAFS Y S, et al. Rapid intraoperative histology of unprocessed surgical specimens via fibre-laser-based stimulated Raman scattering microscopy[J]. Nature biomedical engineering, 2017, 1: 0027.
[6] HAN J, PEI J, KAMBER M. Data mining: concepts and techniques[M]. Singapore: Elsevier, 2011.
[7] JAIN AK, DUBES RC. Algorithms for clustering data[M]. Prentice-Hall, 1988.
[8] DE SOUTO MCP, COELHO ALV, FACELI K, et al. A comparison of external clustering evaluation indices in the context of imbalanced data sets[C]//2012 Brazilian Symposium on Neural Networks (SBRN). [S.l.], 2012: 49-54.
[9] HUANG S, CHRNG Y, LANG D, et al. A formal algorithm for verifying the validity of clustering results based on model checking[J]. PloS one, 2014, 9(3): e90109.
[10] RENDÓN E, ABUNDEZ I, ARIZMENDI A, et al. Internal versus external cluster validation indexes[J]. International journal of computers and communications, 2011, 5(1): 27-34.
[11] ROSALES-MENDÉZ H, RAMÍREZ-CRUZ Y. CICE-BCubed: A new evaluation measure for overlapping clustering algorithms[C]//Iberoamerican Congress on Pattern Recognition. Berlin: Springer Berlin Heidelberg, 2013: 157-164.
[12] SAID AB, HADJIDJ R, FOUFOU S. Cluster validity index based on jeffrey divergence[J]. Pattern analysis and applications, 2017, 20(1): 21-31.
[13] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE transactions on systems, man, and cybernetics, part b (cybernetics), 2009, 39(2): 318-331.
[14] POWERS D M W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness and correlation[J]. Journal of machine learning technologies, 2011, 2: 2229-3981.
[15] LARSEN B, AONE C. Fast and effective text mining using linear-time document clustering[C]//Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA: ACM, 1999: 16-22.
[16] ZU EISSEN, B S S M, WIßBROCK F. On cluster validity and the information need of users[C]//Conference on Artificial Intelligence and Applications, Benalmádena, Spain, 2003. Calgary, Canada: ACTA Press, 2003: 216-221.
[17] 谢娟英. 无监督学习方法及其应用[M]. 北京: 电子工业出版社, 2016.
XIE Juanying, Unsupervised learning methods and applications[M]. Beijing: Publishing House of Electronics Industry, 2016.
[18] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information sciences, 2016, 354: 19-40.
[19] 谢娟英, 高红超, 谢维信. K 近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学: 信息科学, 2016, 46(2): 258-280.
XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia sinica informationis, 2016, 46(2): 258-280.
[20] AMIGÓ E, GONZALO J, ARTILES J, et al. A comparison of extrinsic clustering evaluation metrics based on formal constraints[J]. Information retrieval, 2009, 12(4): 461-486.
[21] VINH NX, EPPS J, BAILEY J. Information theoretic measures for clusterings comparison: is a correction for chance necessary [C]//Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 2009. New York, USA: ACM, 2009: 1073-1080.
[22] D’HAESELEER P. How does gene expression clustering work [J]. Nature biotechnology, 2005, 23(12): 1499.
[23] QUACKENBUSH J. Computational analysis of microarray data[J]. Nature reviews genetics, 2001, 2(6): 418-427.
[24] CHOU CH, SU MC, LAI E. A new cluster validity measure for clusters with different densities[C]//IASTED International Conference on Intelligent Systems and Control. Calgary, Canada: ACTA Press, 2003: 276-281.
[25] 谢娟英, 周颖. 一种新聚类评价指标[J].陕西师范大学学报: 自然科学版, 2015, 43(6): 1-8.
XIE Juanying, ZHOU Ying. A new criterion for clustering algorithm[J]. Journal of Shaanxi normal university: natural science edition, 2015, 43(6): 1-8.
[26] KAPP AV, TIBSHIRANI R. Are clusters found in one dataset present in another dataset[J]. Biostatistics, 2007, 8(1): 9-31.
[27] DAVIES DL, BOULDIN DW. A cluster separation measure[J]. IEEE transactions on pattern analysis and machine intelligence, 1979 (2): 224-227.
[28] HASHIMOTO W, NAKAMURA T, MIYAMOTO S. Comparison and evaluation of different cluster validity measures including their kernelization[J]. Journal of advanced computational intelligence and intelligent informatics, 2009, 13(3): 204-209.
[29] XIE XL, BENI G. A validity measure for fuzzy clustering[J]. IEEE transactions on pattern analysis and machine intelligence, 1991, 13(8): 841-847.
[30] ROUSSEEUW PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis[J]. Journal of computational and applied mathematics, 1987, 20: 53-65.
[31] 周世兵, 徐振源, 唐旭清. 一种基于近邻传播算法的最佳聚类数确定方法[J]. 控制与决策, 2011, 26(8): 1147-1152.
ZHOU Shibing, XU Zhenyuan, TANG Xuqing. Method for determining optimal number of clusters based on affinity propagation clustering[J]. Control and decision, 2011, 26(8): 1147-1152.
[32] 盛骤, 谢式千. 概率论与数理统计及其应用[M]. 北京: 高等教育出版社, 2004.
SHENG Zhou, XIE Shiqian. Probability and mathematical statistics and its application[M]. Beijing: Higher education press, 2004.
[33] LICHMAN M, UCI Machine learning repository[EB/OL]. 2013, University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
[34] 谢娟英, 高瑞. 方差优化初始中心的K-medoids聚类算法[J]. 计算机科学与探索, 2015, 9(8): 973-984.
XIE Juanying, GAO Rui. K-medoids clustering algorithms with optimized initial seeds by variance[J]. Journal of frontiers of computer science and technology, 2015, 9(8): 973-984.
[35] PARK HS, JUN CH. A simple and fast algorithm for K-medoids clustering[J]. Expert systems with applications, 2009, 36(2): 3336-3341.

相似文献/References:

[1]杨小兵,何灵敏,孔繁胜.切换回归模型的抗噪音聚类算法[J].智能系统学报,2009,4(06):497.[doi:10.3969/j.issn.1673-4785.2009.06.005]
 YANG Xiao-bing,HE Ling-min,KONG Fan-sheng.A noise-resistant clustering algorithm for switching regression models[J].CAAI Transactions on Intelligent Systems,2009,4(06):497.[doi:10.3969/j.issn.1673-4785.2009.06.005]
[2]季瑞瑞,刘 丁.支持向量数据描述的基因表达数据聚类方法[J].智能系统学报,2009,4(06):544.[doi:10.3969/j.issn.1673-4785.2009.06.013]
 JI Rui-rui,LIU Ding.Improved gene expression data clustering using a support vector domain description algorithm[J].CAAI Transactions on Intelligent Systems,2009,4(06):544.[doi:10.3969/j.issn.1673-4785.2009.06.013]
[3]张秀玲,逄宗鹏,李少清,等.ANFIS的板形控制动态影响矩阵方法[J].智能系统学报,2010,5(04):360.
 ZHANG Xiu-ling,PANG Zong-peng,LI Shao-qing,et al.A dynamic influence matrix method for flatness control based on adaptivenetworkbased fuzzy inference systems[J].CAAI Transactions on Intelligent Systems,2010,5(06):360.
[4]李伟,杨晓峰,张重阳,等.复杂网络社团的投影聚类划分[J].智能系统学报,2011,6(01):57.
 LI Wei,YANG Xiaofeng,ZHANG Chongyang,et al.A clustering method for community detection on complex networks[J].CAAI Transactions on Intelligent Systems,2011,6(06):57.
[5]陈岳峰,苗夺谦,李文,等.基于概念的词汇情感倾向识别方法[J].智能系统学报,2011,6(06):489.
 CHEN Yuefeng,MIAO Duoqian,LI Wen,et al.Semantic orientation computing based on concepts[J].CAAI Transactions on Intelligent Systems,2011,6(06):489.
[6]方然,苗夺谦,张志飞.一种基于情感的中文微博话题检测方法[J].智能系统学报,2013,8(03):208.
 FANG Ran,MIAO Duoqian,ZHANG Zhifei.An emotion-based method of topic detection from Chinese microblogs[J].CAAI Transactions on Intelligent Systems,2013,8(06):208.
[7]卿铭,孙晓梅.一种新的聚类有效性函数:模糊划分的模糊熵[J].智能系统学报,2015,10(01):75.[doi:10.3969/j.issn.1673-4785.201410004]
 QING Ming,SUN Xiaomei.A new clustering effectiveness function: fuzzy entropy of fuzzy partition[J].CAAI Transactions on Intelligent Systems,2015,10(06):75.[doi:10.3969/j.issn.1673-4785.201410004]
[8]刘恋,常冬霞,邓勇.动态小生境人工鱼群算法的图像分割[J].智能系统学报,2015,10(5):669.[doi:10.11992/tis.201501001]
 LIU lian,CHANG Dongxia,DENG Yong.An image segmentation method based on dynamic niche artificial fish-swarm algorithm[J].CAAI Transactions on Intelligent Systems,2015,10(06):669.[doi:10.11992/tis.201501001]
[9]刘贝贝,马儒宁,丁军娣.基于密度的统计合并聚类算法[J].智能系统学报,2015,10(5):712.[doi:10.11992/tis.201410028]
 LIU Beibei,MA Runing,DING Jundi.Density-based statistical merging clustering algorithm[J].CAAI Transactions on Intelligent Systems,2015,10(06):712.[doi:10.11992/tis.201410028]
[10]朱书伟,周治平,张道文.融合并行混沌萤火虫算法的K-调和均值聚类[J].智能系统学报,2015,10(6):872.[doi:10.11992/tis.201505043]
 ZHU Shuwei,ZHOU Zhiping,ZHANG Daowen.K-harmonic means clustering merged with parallel chaotic firefly algorithm[J].CAAI Transactions on Intelligent Systems,2015,10(06):872.[doi:10.11992/tis.201505043]

备注/Memo

备注/Memo:
收稿日期:2017-06-08;改回日期:。
基金项目:国家自然科学基金项目(61673251);陕西省科技攻关项目(2013K12-03-24);陕西师范大学研究生创新基金项目(2015CXS028,2016CSY009);中央高校基本科研业务费重点项目(GK201701006).
作者简介:谢娟英,女,1971年生,副教授,博士,主要研究方向为机器学习、数据挖掘和生物医学大数据分析。国际期刊HISS副编委。发表学术论文60余篇,单篇googlescholar他引次数百余次,SCI源刊数据库单篇他引次数40余次。出版专著2部;周颖,女,1992年生,硕士研究生,主要研究方向为数据挖掘;王明钊,男,1990年生,硕士研究生,主要研究方向为数据挖掘。
通讯作者:谢娟英.E-mail:xiejuany@snnu.edu.cn.
更新日期/Last Update: 2018-01-03