Elsevier

Information Sciences

Volume 506, January 2020, Pages 346-365
Information Sciences

A new internal index based on density core for clustering validation

https://doi.org/10.1016/j.ins.2019.08.029Get rights and content

Abstract

Clustering validation which is applied to the evaluation of clustering results has been recognized as one of the vital issues for clustering application. Density core, a set of connected maximum local density peaks, can approximately represent the structure of a cluster without being affected by noises. Therefore, a density-core-based clustering validation index (DCVI) with minimum spanning tree (MST) is proposed to solve the problem that noises and arbitrary shapes have great influence on the performance of widely used internal validation measures. As for data sets containing arbitrarily shaped clusters with noises, DCVI can effectively obtain the optimal number of clusters. Moreover, experimental results of both synthetic and real data sets indicate that DCVI outperforms most existing measures.

Introduction

The aim of clustering is to divide objects into clusters according to the similarity measure of objects that objects within the same cluster have a high degree of similarity while objects belonging to different clusters have a high degree of dissimilarity [24], [36]. As one of the essential methods used in learning systems, clustering has been studied extensively in statistics, machine learning and database communities [18], [29], [30], [31], [32], [42]. clustering methods are generally classified into density-based clustering, partition clustering and hierarchical clustering. Density-based clustering assumes that clusters are high-density regions of input distributions surrounded by low density areas [26] and DBSCAN is an algorithm which follows this line of researches [11]. As to K-means [17], one of the best known partition clustering algorithms, its basic idea is to partition a data set into clusters. Hierarchical clustering algorithms, including agglomerative hierarchical clustering (AHC) and divisive hierarchical clustering (DHC) [44], build a cluster hierarchy and obtain various clusters according to hierarchy information.

It is assumed in most basic clustering algorithms that the number of clusters, namely a user-defined parameter, is difficult to set in practice [13]. As an unsupervised learning task, clustering needs to validate the quality of partition results, otherwise it would be difficult to make use of different clustering results [25]. Therefore, clustering validation indexes which play an important role in clustering analyses are used to evaluate and verify clustering results. According to whether external information is needed, clustering validation indexes can be classified into two main categories: external clustering validation indexes and internal clustering validation indexes. External validation indexes aim at comparing a clustering result with a pre-determined clustering partition which is usually referred as the golden or ground truth partition [27], while internal validation indexes are usually applied to the selection of the best clustering algorithm as well as the optimal number of clusters without any additional information. External information is usually unavailable in common application, therefore internal validation indexes are more applicable for clustering validation. A variety of clustering validation indexes are available to determine the number of clusters in each data set. The Davies–Bouldin index (DBI) [10], the Silhouette index (Sil) [35], the Calinski-Harabasz index (CH) [3] and the Gap index (Gap) [38] are several recommended clustering validation measures. However, current measures which are usually affected by noises merely perform well on sphered-shaped clusters. Recently, internal clustering validation indexes have been proposed to handle data sets with arbitrary shaped clusters. A clustering validation index based on the nearest neighbor (CVNN) is proposed by Liu Y [25] and a clustering validation index based on the agglomerative hierarchical algorithm (CSP) is presented by Zhou [48]. However, without being robust to noises, CVNN and CSP are only effective for well-separated clusters. A clustering validation index based on local core (LCVV) [4] is proposed to improve the Sil index, but the results of LCVV are affected by noises.

To solve above problems, we propose a new density-core-based clustering validation index (DCVI) which is robust to noises as well as effectively evaluates clustering results of arbitrary shapes. Density core can approximately represent the structure of clusters without being affected by noises, therefore DCVI employs density cores as representatives to evaluate within-cluster compactness and between-cluster separation. DCVI is based on density core and minimum spanning tree(MST), therefore we employ MST and DCVI for constructing MST-DCVI clustering algorithm in order to find the optimal number of clusters, which reduces the cost of constructing minimum spanning trees repeatedly in the application of DCVI and improve the efficiency of clustering algorithms. Experimental results show that MST-DCVI outperforms most existing measures in both synthetic and real data sets.

The remaining part of this paper is organized as follows. Section 2 presents a brief overview of internal clustering validation measures and MST-based clustering algorithms. Section 3 introduces the concept of density core. Section 4 describes a novel MST clustering algorithm which is based on a new validation index. Section 5 presents experimental results of synthetic and real data sets and Section 6 makes a discussion on DCVI. Finally, a summary of this paper and future work are made in Section 7.

Section snippets

Related works

We review related works on internal clustering validation indexes and the MST clustering algorithm in this section.

The Natural Neighbor Stable Structure

The Natural Neighbor Stable Structure [49], a new concept coming from the objective reality, is inspired by interpersonal relationships in human society, namely the number of one’s true friends should be the number of people taking him/her as a friend and considered by him/her as friends at the same time. If everyone has at least one friend, it will form a relatively stable structure called the Natural Neighbor Stable Structure (NSS) in the society. The key idea of the Natural Neighbor Stable

The density-core-based clustering validation index

We assume TD=(VTD,ETD) is the minimum spanning tree of graph GD=(VD,ED) which denotes the graph of data set D. |V| is the number of vertexes and |E| is the number of edges. W(e) is the weight value of an edge and W′(C) is the c-th largest weight value in the minimum spanning three of Graph GD. According to theorem in the CSP index [48], for any l ≤ c ≤ |E|, we remove edges whose weight values are greater than W′(C) from TD and c connected branches which are marked as Tk=(Vk,Ek)(k=1,2,,c) are

Experimental evaluations

In this section, 16 synthetic data sets and 6 real data sets are used to evaluate the validation properties and performance of DCVI. We compare MST-DCVI not only with the AHC algorithm which is based on other indexes like DBI, Sil, CH, Gap, CSP, CVNN and LCVV, but also with evolutionary clustering algorithms obtaining the optimal number of clusters such as the Genetic-based clustering algorithm (GA) [7], the Particle Swarm Optimization-based clustering algorithm (PSO) [45] and the Differential

Discussion

In this paper, density core is used to compute DCVI in order to minimize adverse effects of noises. Moreover, we perform experiments on synthetic data sets to demonstrate that density core produces better clustering results than other objects do.

As shown in Table 13, it is clear that density cores produce the correct optimal number of clusters for all data sets while all objects only obtain the correct number of clusters for DS11 and DS12. Due to adverse effects of noises, the use of all

Conclusion

Unlike existing methods which are mainly designed for all objects in a data set, a clustering validation method on the basis of density core is proposed in this paper. Density core can eliminate adverse effects of noises, therefore DCVI is effective for obtaining the optimal number of clusters in data sets with noises. Based on the MST-based clustering algorithm and DCVI, the MST-DCVI clustering algorithm is proposed to evaluate clusters of arbitrary shapes. Experimental results of synthetic

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgments

The authors would like to thank the Associate Editor and anonymous reviewers for their valuable comments and suggestions. They are also obliged to Ming-Ming Zhang for giving advice on our manuscript. This work is partly funded by the National Natural Science Foundation of China (no. 61762025), the project of frontier and application foundation research program of CQ CSTC (no. cstc2017jcyjAX0340) and CERNET Next Generation Internet Technology Innovation Project (NGII 20150706).

References (49)

  • G.X. Ritter et al.

    A simple statistics-based nearest neighbor cluster detection algorithm

    Pattern Recognit.

    (2015)
  • P. Rousseeuw

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • Y. Zhang et al.

    Curvature-based method for determining the number of clusters

    Inf. Sci.

    (2017)
  • C. Zhong et al.

    Minimum spanning tree based split-and-merge: a hierarchical clustering method

    Inf. Sci.

    (2011)
  • Q. Zhu et al.

    Natural neighbor: a self-adaptive neighborhood method without parameter k

    Pattern Recognit. Lett.

    (2016)
  • J.L. Bentley

    Multidimensional binary search trees used for associative searching

    Commun. ACM

    (1975)
  • J.C. Bezdek et al.

    Some new indexes of cluster validity

    Syst. Man Cybern. Part B Cybern. IEEE Trans.

    (1998)
  • T. Caliński et al.

    A dendrite method for cluster analysis

    Commun. Stat.-Theory Methods

    (1974)
  • D. Cheng et al.

    A novel cluster validity index based on local cores

    IEEE Trans. Neural Netw. Learn.Syst.

    (2018)
  • Y. Cheng

    Mean shift, mode seeking, and clustering

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1995)
  • D. Dajun et al.

    An improved genetic algorithm for spatial clustering

    Procedings – 18th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2006

    (2006)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach.Intell.

    (1979)
  • M. Ester et al.

    Density-based clustering in spatial databases: the algorithm gdbscan and its applications

    Data Min. Knowl. Discov.

    (1998)
  • T. Fawcett

    An introduction to ROC analysis

    Pattern Recognit. Lett.

    (2005)
  • Cited by (0)

    View full text