Home About IUP Magazines Journals Books Archives
     
A Guided Tour | Recommend | Links | Subscriber Services | Feedback | Subscribe Online
 
The IUP Journal of Genetics & Evolution
An Empirical Investigation on Classical Clustering Methods
:
:
:
:
:
:
:
:
:
 
 
 
 
 
 
 

Five classical clustering methods: four hierarchical - single linkage, average-between linkage, average-within linkage, Wards - and one non-hierarchical - k-means - using five different distance measures: squared Euclidean, city block, Chebychev's, Pearson correlation and Minkowski have been compared on the basis of simulated multivariate data on paddy crop genotypes. The performance of different clustering methods was compared based on the average percentage probability of misclassification and its standard error.

 
 
 

The performance of different hierarchical clustering methods varied with distance measures used and it was found that squared Euclidean performed best among the five distances followed by city block distance in majority of cases. Among the five methods, the Ward's method performed best with least average percentage probability of misclassification followed by non-hierarchical k-means method irrespective of the sample size. Among the different distance measures used under hierarchical clustering methods, the squared Euclidean distance showed least average percentage probability of misclassification followed by city block distance.

The summarization of large quantities of multivariate data is being increasingly practiced in various branches of agricultural science. A number of multivariate statistical techniques, namely, cluster analysis, principal components analysis, factor analysis are being widely used for classification purposes. One of the basic problems faced by the plant breeders is to classify large number of genotypes/lines into fewer manageable homogeneous groups/clusters. There are large number of clustering methods and dissimilarity measures available in literature for making homogeneous groups. One of the main problems faced by the breeder is to choose a suitable method of clustering and dissimilarity measure among the different methods and dissimilarity measures available in literature. There is hardly any information available in literature on the performance of these clustering methods and dissimilarity measures. Researchers commonly use UPGMA (Unweighted Pair-Group Method using Arithmetic Averages) and Ward's method followed by SLINK (Single Linkage) and CLINK (Complete Linkage) among the existing clustering methods. According to Blashfield (1976) UPGMA, Ward's and SLINK account for 3/4th of the published work which used cluster analysis technique. The lesser used clustering methods, which appear occasionally in applications are WPGMA (Weighted Pair-Group Method using Arithmetic Averages) method, the centroid method and the flexible method (Sneath and Sokal, 1973). Lin (1982), Ramey and Rosielle (1983), Wahi and Kher (1991) promoted the application of clustering techniques to group the genotypes or environments but the number of clusters obtained from these methods are not unique because of unrepresentativeness of the clustering groups obtained through different clustering procedures. k-means method requires prior knowledge of the number of clusters but unfortunately in the case of unsupervised classification usually there is no prior idea about the number of clusters. On the contrary, hierarchical clustering methods do not require a prior knowledge of number of clusters, which is a definite advantage over k-means method.

 
 
 

Genetics & Evolution Journal, Empirical Investigation, Classical Clustering Methods, Cluster Analysis, Principal Components Analysis, Squared Euclidean Distance, Hierarchical Clustering Methods, Pearson Correlation, Clustering Methods, Homogeneous Clusters, Agricultural Sciences.