Computer Sciences Journal | CatSub: A Technique for Clustering Categorical Data Based on Subspace

The IUP Journal of Computer Sciences :

CatSub: A Technique for Clustering Categorical Data Based on Subspace

Article Details

Pub. Date	:	April, 2008
Product Name	:	The IUP Journal of Computer Sciences
Product Type	:	Article
Product Code	:	IJCS10804
Author Name	:	B Borah and D K Bhattacharyya
Availability	:	YES
Subject/Domain	:	Science and Technology
Download Format	:	PDF Format
No. of Pages	:	14

Price

For delivery in electronic format: Rs. 50;
For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Abstract

Clustering is one of the important data mining problems. Many algorithms are available to cluster numeric data. However, a limited number of algorithms are proposed to cluster categorical datasets for which distance measures are not naturally defined. Moreover, categorical datasets are generally high dimensional. In high dimensional data many of the dimensions are often irrelevant or correlated, and different clusters exist in different subsets of dimensions. In this work we propose a subspace clustering algorithm named CatSub to extract a set of disjoint clusters and outliers in large high dimensional categorical datasets. We define a similarity measure and a strategy to find subsets of relevant dimensions along with the clusters embedded in them. The algorithm is scalable to larger datasets as it requires only a single pass through the data objects that need not be stored in main memory. The algorithm is experimentally validated on several real world datasets.

Description

Rapid advances in digital data acquisition and storage technology have resulted in the growth of huge databases (or datasets) containing millions of objects, each recording values for hundreds of attributes that may be numeric or categorical. Data mining or knowledge discovery in databases is the science of extracting useful information from such huge databases (Han and Kamber, 2006). It aims at the construction of automatic or semiautomatic tools for the analysis of such datasets. Clustering is one of the most important tasks in data mining (Han and Kamber, 2006). The main goal of clustering is to group data objects into clusters in such a way that objects belonging to the same cluster are similar, while those belonging to different ones are dissimilar.

By clustering one can identify dense and sparse regions and, therefore, discover overall distribution patterns and interesting correlations among the attributes. A survey of clustering algorithms can be found in Andritsos (2002). Many algorithms have been developed for clustering numeric data based on the use of similarity measures that exploit inherent geometrical structures of numeric data. A limited number of studies have focused on clustering categorical data, where the domains of the individual attributes are discrete valued and not naturally ordered, and therefore distance functions are not naturally defined. Moreover, categorical datasets are generally high dimensional.

Most of the common clustering algorithms fail to perform efficiently and accurately for high dimensional data, because such datasets do not exhibit clusters over the full set of dimensions. Many of the dimensions are often irrelevant or correlated, and different clusters may have different subsets of relevant dimensions. Subspace clustering algorithms find a subset of relevant dimensions for each cluster (Parsons et al., 2004). Subspaces of different clusters are almost always allowed to be overlapping. Some algorithms allow the clusters to be overlapping, while others find a set of disjoint clusters that cover the entire dataset. Some algorithms also detect outliers, which are the objects that do not belong to any of the clusters.

Keywords

Clustering Algorithms, Digital Data Acquisition, SUBspace Clustering for high Dimensional Categorical Data, SUBCAD, Soybean Disease Dataset, Data Clustering Techniques, Wisconsin Breast Cancer Dataset, Mushroom Dataset, Categorical Datasets, Clustering of Categorical Data.