Volume 5, Issue 1, March 2020, Page: 20-25
Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets
Md. Bipul Hossen, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Rabiul Auwul, Department of Statistics, Guangzhou University, Guangzhou, China
Received: Dec. 29, 2019;       Accepted: Jan. 10, 2020;       Published: Mar. 2, 2020
DOI: 10.11648/j.bsi.20200501.14      View  407      Downloads  117
Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.
Microarray, Clustering Algorithm, Gap Statistic, Validity Indices
To cite this article
Md. Bipul Hossen, Md. Rabiul Auwul, Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets, Biomedical Statistics and Informatics. Vol. 5, No. 1, 2020, pp. 20-25. doi: 10.11648/j.bsi.20200501.14
Copyright © 2020 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
