[1]谢娟英,周颖,王明钊,等.聚类有效性评价新指标[J].智能系统学报,2017,12(6):873-882.[doi:10.11992/tis.201706029]
XIE Juanying,ZHOU Ying,WANG Mingzhao,et al.New criteria for evaluating the validity of clustering[J].CAAI Transactions on Intelligent Systems,2017,12(6):873-882.[doi:10.11992/tis.201706029]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
12
期数:
2017年第6期
页码:
873-882
栏目:
学术论文—人工智能基础
出版日期:
2017-12-25
- Title:
-
New criteria for evaluating the validity of clustering
- 作者:
-
谢娟英, 周颖, 王明钊, 姜炜亮
-
陕西师范大学 计算算计科学学院, 陕西 西安 710062
- Author(s):
-
XIE Juanying, ZHOU Ying, WANG Mingzhao, JIANG Weiliang
-
School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
-
- 关键词:
-
聚类; 聚类有效性; 评价指标; 外部指标; 内部指标; F-measure; Adjusted Rand Index; STDI; S2; PS2
- Keywords:
-
clustering; validity of clustering; evaluation index; external criteria; internal criteria; F-measure; Adjusted Rand Index; STDI; S2; PS2
- 分类号:
-
TP108
- DOI:
-
10.11992/tis.201706029
- 摘要:
-
聚类有效性评价指标分为外部评价指标和内部评价指标两大类。现有外部评价指标没有考虑聚类结果类偏斜现象;现有内部评价指标的聚类有效性检验效果难以得到最佳类簇数。针对现有内外部聚类评价指标的缺陷,提出同时考虑正负类信息的分别基于相依表和样本对的外部评价指标,用于评价任意分布数据集的聚类结果;提出采用方差度量类内紧密度和类间分离度,以类间分离度与类内紧密度之比作为度量指标的内部评价指标。UCI数据集和人工模拟数据集实验测试表明,提出的新内部评价指标能有效发现数据集的真实类簇数;提出的基于相依表和样本对的外部评价指标,可有效评价存在类偏斜与噪音数据的聚类结果。
- Abstract:
-
There are two kinds of criteria for evaluating the clustering ability of a clustering algorithm, internal and external. The current external evaluation indexes fails to consider the skewed clustering result; it is difficult to get optimum cluster numbers from the clustering validity inspection results from the internal evaluation indexes. Considering the defects in the present internal and external clustering evaluation indices, we propose two external evaluation indexes, which consider both positive and negative information and which are respectively based on the contingency table and sample pairs for the evaluation of clustering results from a dataset with arbitrary distribution. The variance is proposed to measure the tightness of a cluster and the separability between clusters, and the ratio of these parameters is used as an internal evaluation index for the measurement index. Experiments on the datesets from UCI (University of California in Iven) machine learning repository and artificially simulated datasets show that the proposed new internal index can be used to effectively find the truenumber of clusters in a dataset. The proposed external indexes based on the contingency table and sample pairs are a very effective external evaluation indexes and can be used to evaluate the clustering results from existing types of skewed and noisy data.
备注/Memo
收稿日期:2017-06-08;改回日期:。
基金项目:国家自然科学基金项目(61673251);陕西省科技攻关项目(2013K12-03-24);陕西师范大学研究生创新基金项目(2015CXS028,2016CSY009);中央高校基本科研业务费重点项目(GK201701006).
作者简介:谢娟英,女,1971年生,副教授,博士,主要研究方向为机器学习、数据挖掘和生物医学大数据分析。国际期刊HISS副编委。发表学术论文60余篇,单篇googlescholar他引次数百余次,SCI源刊数据库单篇他引次数40余次。出版专著2部;周颖,女,1992年生,硕士研究生,主要研究方向为数据挖掘;王明钊,男,1990年生,硕士研究生,主要研究方向为数据挖掘。
通讯作者:谢娟英.E-mail:xiejuany@snnu.edu.cn.
更新日期/Last Update:
2018-01-03