Genetics & Evolution Journal | An Empirical Investigation on Classical Clustering Methods

The IUP Journal of Genetics & Evolution

An Empirical Investigation on Classical Clustering Methods

Article Details

Pub. Date	:	August, 2009
Product Name	:	The IUP Journal of Genetics & Evolution
Product Type	:	Article
Product Code	:	IJGE90908
Author Name	:	S D Wahi, Sukanta Dash and A R Rao
Availability	:	YES
Subject/Domain	:	Science & Technology
Download Format	:	PDF Format
No. of Pages	:	6

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Abstract

Five classical clustering methods: four hierarchical - single linkage, average-between linkage, average-within linkage, Wards - and one non-hierarchical - k-means - using five different distance measures: squared Euclidean, city block, Chebychev's, Pearson correlation and Minkowski have been compared on the basis of simulated multivariate data on paddy crop genotypes. The performance of different clustering methods was compared based on the average percentage probability of misclassification and its standard error.

Description

The performance of different hierarchical clustering methods varied with distance measures used and it was found that squared Euclidean performed best among the five distances followed by city block distance in majority of cases. Among the five methods, the Ward's method performed best with least average percentage probability of misclassification followed by non-hierarchical k-means method irrespective of the sample size. Among the different distance measures used under hierarchical clustering methods, the squared Euclidean distance showed least average percentage probability of misclassification followed by city block distance.

The summarization of large quantities of multivariate data is being increasingly practiced in various branches of agricultural science. A number of multivariate statistical techniques, namely, cluster analysis, principal components analysis, factor analysis are being widely used for classification purposes. One of the basic problems faced by the plant breeders is to classify large number of genotypes/lines into fewer manageable homogeneous groups/clusters. There are large number of clustering methods and dissimilarity measures available in literature for making homogeneous groups. One of the main problems faced by the breeder is to choose a suitable method of clustering and dissimilarity measure among the different methods and dissimilarity measures available in literature. There is hardly any information available in literature on the performance of these clustering methods and dissimilarity measures. Researchers commonly use UPGMA (Unweighted Pair-Group Method using Arithmetic Averages) and Ward's method followed by SLINK (Single Linkage) and CLINK (Complete Linkage) among the existing clustering methods. According to Blashfield (1976) UPGMA, Ward's and SLINK account for 3/4^th of the published work which used cluster analysis technique. The lesser used clustering methods, which appear occasionally in applications are WPGMA (Weighted Pair-Group Method using Arithmetic Averages) method, the centroid method and the flexible method (Sneath and Sokal, 1973). Lin (1982), Ramey and Rosielle (1983), Wahi and Kher (1991) promoted the application of clustering techniques to group the genotypes or environments but the number of clusters obtained from these methods are not unique because of unrepresentativeness of the clustering groups obtained through different clustering procedures. k-means method requires prior knowledge of the number of clusters but unfortunately in the case of unsupervised classification usually there is no prior idea about the number of clusters. On the contrary, hierarchical clustering methods do not require a prior knowledge of number of clusters, which is a definite advantage over k-means method.

Keywords

Genetics & Evolution Journal, Empirical Investigation, Classical Clustering Methods, Cluster Analysis, Principal Components Analysis, Squared Euclidean Distance, Hierarchical Clustering Methods, Pearson Correlation, Clustering Methods, Homogeneous Clusters, Agricultural Sciences.