Rapid
advances in digital data acquisition and storage technology
have resulted in the growth of huge databases (or datasets)
containing millions of objects, each recording values for
hundreds of attributes that may be numeric or categorical.
Data mining or knowledge discovery in databases is the science
of extracting useful information from such huge databases
(Han and Kamber, 2006). It aims at the construction of automatic
or semiautomatic tools for the analysis of such datasets.
Clustering is one of the most important tasks in data mining
(Han and Kamber, 2006). The main goal of clustering is to
group data objects into clusters in such a way that objects
belonging to the same cluster are similar, while those belonging
to different ones are dissimilar.
By clustering one can identify
dense and sparse regions and, therefore, discover overall
distribution patterns and interesting correlations among the
attributes. A
survey of clustering algorithms can be found in Andritsos
(2002). Many algorithms have been developed for clustering
numeric data based on the use of similarity measures that
exploit inherent geometrical structures of numeric data. A
limited number of studies have focused on clustering categorical
data, where the domains of the individual attributes are discrete
valued and not naturally ordered, and therefore distance functions
are not naturally defined. Moreover, categorical datasets
are generally high dimensional.
Most of the common clustering
algorithms fail to perform efficiently and accurately for
high dimensional data, because such datasets do not exhibit
clusters over the full set of dimensions. Many of the dimensions
are often irrelevant or correlated, and different clusters
may have different subsets of relevant dimensions. Subspace
clustering algorithms find a subset of relevant dimensions
for each cluster (Parsons et al., 2004). Subspaces
of different clusters are almost always allowed to be overlapping.
Some algorithms allow the clusters to be overlapping, while
others find a set of disjoint clusters that cover the entire
dataset. Some algorithms also detect outliers, which are the
objects that do not belong to any of the clusters. |