IUP Publications Online
Home About IUP Magazines Journals Books Archives
     
Recommend    |    Subscriber Services    |    Feedback    |     Subscribe Online
 
The IUP Journal of Information Technology
Classification of Skewed Data: A Comparative Analysis of the Performance of Select Classifiers
:
:
:
:
:
:
:
:
:
 
 
 
 
 
 

Many real-world data exhibit significant skewness wherein one class instances outnumber the other class instances. For example, skewed or imbalance class distribution is observed in domains like financial fraud detection, network intrusion detection, etc. The most important fact in these kinds of data is that, compared to the majority class, minority class is often of much significance and interest to the user. However, in such cases, skewed class distribution increases the problems associated with classifying the dataset instances. Usually, a classifier is biased towards the majority class and tends to misclassify the minority class instances, resulting in challenges to data mining. In this paper, we use oversampling to study the classification accuracy of some popular classifiers. The results show that the approach provides satisfactory performance on skewed dataset.

 
 

Many applications generate volumes of data during day-to-day transactions. Organizations may archive this data for their vital decision makings. Such huge amount of data requires data mining techniques for extracting potentially useful information and knowledge discovery. Data mining has been used in various areas like healthcare, business intelligence, financial trade analysis, network intrusion detection, etc. (Klosgen and Zytkow, 2002).

In many areas such as financial fraud detection where thousands of transactions are genuine and legitimate, a few of them are fraudulent transactions. Identifying such fraudulent transactions is not only important but also a difficult task. Similarly, in the case of network intrusion detection, there are both normal connection requests as well as malicious connection requests to a host computer. The malicious connection requests are very few in number and identifying them becomes the primary goal. Thus, class imbalance is a prominent characteristic of most of the real-life data (Provost and Fawcett, 2001).

Class imbalance or skewed data refers to data where in one class instances outnumber the other class instances. The class instances which occupy the majority of the dataset are known as majority class, while the other class instances are known as minority class. The most important fact in these kinds of datasets is that minority class instances are often of much significance and interest to the user.

Most of the learning algorithms work well with balanced dataset in classification tasks. However, such algorithms fail to work on skewed data. Their accuracy of classifying majority instances is good but the accuracy of classifying the minority instances is poor. This happens as learning from such imbalanced or skewed data causes the learning algorithms to become biased towards the majority class thereby misclassifying the instances of the minority class. As the minority class holds significant interest, this issue of misclassification attracts ample research attention.

 
 

Information Technology Journal, Skewed data, Data mining, Classifier performance, Classification of skewed data, Class imbalance, Minority class .