A new internal index based on density core for clustering validation
Introduction
The aim of clustering is to divide objects into clusters according to the similarity measure of objects that objects within the same cluster have a high degree of similarity while objects belonging to different clusters have a high degree of dissimilarity [24], [36]. As one of the essential methods used in learning systems, clustering has been studied extensively in statistics, machine learning and database communities [18], [29], [30], [31], [32], [42]. clustering methods are generally classified into density-based clustering, partition clustering and hierarchical clustering. Density-based clustering assumes that clusters are high-density regions of input distributions surrounded by low density areas [26] and DBSCAN is an algorithm which follows this line of researches [11]. As to K-means [17], one of the best known partition clustering algorithms, its basic idea is to partition a data set into clusters. Hierarchical clustering algorithms, including agglomerative hierarchical clustering (AHC) and divisive hierarchical clustering (DHC) [44], build a cluster hierarchy and obtain various clusters according to hierarchy information.
It is assumed in most basic clustering algorithms that the number of clusters, namely a user-defined parameter, is difficult to set in practice [13]. As an unsupervised learning task, clustering needs to validate the quality of partition results, otherwise it would be difficult to make use of different clustering results [25]. Therefore, clustering validation indexes which play an important role in clustering analyses are used to evaluate and verify clustering results. According to whether external information is needed, clustering validation indexes can be classified into two main categories: external clustering validation indexes and internal clustering validation indexes. External validation indexes aim at comparing a clustering result with a pre-determined clustering partition which is usually referred as the golden or ground truth partition [27], while internal validation indexes are usually applied to the selection of the best clustering algorithm as well as the optimal number of clusters without any additional information. External information is usually unavailable in common application, therefore internal validation indexes are more applicable for clustering validation. A variety of clustering validation indexes are available to determine the number of clusters in each data set. The Davies–Bouldin index (DBI) [10], the Silhouette index (Sil) [35], the Calinski-Harabasz index (CH) [3] and the Gap index (Gap) [38] are several recommended clustering validation measures. However, current measures which are usually affected by noises merely perform well on sphered-shaped clusters. Recently, internal clustering validation indexes have been proposed to handle data sets with arbitrary shaped clusters. A clustering validation index based on the nearest neighbor (CVNN) is proposed by Liu Y [25] and a clustering validation index based on the agglomerative hierarchical algorithm (CSP) is presented by Zhou [48]. However, without being robust to noises, CVNN and CSP are only effective for well-separated clusters. A clustering validation index based on local core (LCVV) [4] is proposed to improve the Sil index, but the results of LCVV are affected by noises.
To solve above problems, we propose a new density-core-based clustering validation index (DCVI) which is robust to noises as well as effectively evaluates clustering results of arbitrary shapes. Density core can approximately represent the structure of clusters without being affected by noises, therefore DCVI employs density cores as representatives to evaluate within-cluster compactness and between-cluster separation. DCVI is based on density core and minimum spanning tree(MST), therefore we employ MST and DCVI for constructing MST-DCVI clustering algorithm in order to find the optimal number of clusters, which reduces the cost of constructing minimum spanning trees repeatedly in the application of DCVI and improve the efficiency of clustering algorithms. Experimental results show that MST-DCVI outperforms most existing measures in both synthetic and real data sets.
The remaining part of this paper is organized as follows. Section 2 presents a brief overview of internal clustering validation measures and MST-based clustering algorithms. Section 3 introduces the concept of density core. Section 4 describes a novel MST clustering algorithm which is based on a new validation index. Section 5 presents experimental results of synthetic and real data sets and Section 6 makes a discussion on DCVI. Finally, a summary of this paper and future work are made in Section 7.
Section snippets
Related works
We review related works on internal clustering validation indexes and the MST clustering algorithm in this section.
The Natural Neighbor Stable Structure
The Natural Neighbor Stable Structure [49], a new concept coming from the objective reality, is inspired by interpersonal relationships in human society, namely the number of one’s true friends should be the number of people taking him/her as a friend and considered by him/her as friends at the same time. If everyone has at least one friend, it will form a relatively stable structure called the Natural Neighbor Stable Structure (NSS) in the society. The key idea of the Natural Neighbor Stable
The density-core-based clustering validation index
We assume is the minimum spanning tree of graph which denotes the graph of data set D. |V| is the number of vertexes and |E| is the number of edges. W(e) is the weight value of an edge and W′(C) is the c-th largest weight value in the minimum spanning three of Graph GD. According to theorem in the CSP index [48], for any l ≤ c ≤ |E|, we remove edges whose weight values are greater than W′(C) from TD and c connected branches which are marked as are
Experimental evaluations
In this section, 16 synthetic data sets and 6 real data sets are used to evaluate the validation properties and performance of DCVI. We compare MST-DCVI not only with the AHC algorithm which is based on other indexes like DBI, Sil, CH, Gap, CSP, CVNN and LCVV, but also with evolutionary clustering algorithms obtaining the optimal number of clusters such as the Genetic-based clustering algorithm (GA) [7], the Particle Swarm Optimization-based clustering algorithm (PSO) [45] and the Differential
Discussion
In this paper, density core is used to compute DCVI in order to minimize adverse effects of noises. Moreover, we perform experiments on synthetic data sets to demonstrate that density core produces better clustering results than other objects do.
As shown in Table 13, it is clear that density cores produce the correct optimal number of clusters for all data sets while all objects only obtain the correct number of clusters for DS11 and DS12. Due to adverse effects of noises, the use of all
Conclusion
Unlike existing methods which are mainly designed for all objects in a data set, a clustering validation method on the basis of density core is proposed in this paper. Density core can eliminate adverse effects of noises, therefore DCVI is effective for obtaining the optimal number of clusters in data sets with noises. Based on the MST-based clustering algorithm and DCVI, the MST-DCVI clustering algorithm is proposed to evaluate clusters of arbitrary shapes. Experimental results of synthetic
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Acknowledgments
The authors would like to thank the Associate Editor and anonymous reviewers for their valuable comments and suggestions. They are also obliged to Ming-Ming Zhang for giving advice on our manuscript. This work is partly funded by the National Natural Science Foundation of China (no. 61762025), the project of frontier and application foundation research program of CQ CSTC (no. cstc2017jcyjAX0340) and CERNET Next Generation Internet Technology Innovation Project (NGII 20150706).
References (49)
- et al.
Minimal spanning tree based clustering technique: relationship with bayes classifier
Pattern Recognit.
(1997) - et al.
Automatic image pixel clustering with an improved differential evolution
Appl. Soft Comput.
(2009) - et al.
Looking for natural patterns in data: part 1. Density-based approach
Chemom. Intell. Lab. Syst.
(2001) - et al.
ESC: an efficient synchronization-based clustering algorithm
Knowl.-Based Syst.
(2013) - et al.
A non-parameter outlier detection algorithm based on natural neighbor
Knowl.-Based Syst.
(2016) Data clustering: 50 years beyond k-means
Int. Conf. Pattern Recognit.
(2010)- et al.
How many clusters? A robust pso-based local density model
Neurocomputing
(2016) - et al.
Shared-nearest-neighbor-based clustering by fast search and find of density peaks
Inf. Sci.
(2018) - et al.
Unsupervised learning by cluster quality optimization
Inf. Sci.
(2018) - et al.
An extended membrane system with active membranes to solve automatic fuzzy clustering problems
Int. J. Neural Syst.
(2016)