Many applications generate volumes of data during day-to-day transactions.
Organizations may archive this data for their vital decision makings. Such huge amount
of data requires data mining techniques for extracting potentially useful information
and knowledge discovery. Data mining has been used in various areas like healthcare,
business intelligence, financial trade analysis, network intrusion detection, etc.
(Klosgen and Zytkow, 2002).
In many areas such as financial fraud detection where thousands of transactions
are genuine and legitimate, a few of them are fraudulent transactions. Identifying
such fraudulent transactions is not only important but also a difficult task. Similarly,
in the case of network intrusion detection, there are both normal connection requests
as well as malicious connection requests to a host computer. The malicious connection
requests are very few in number and identifying them becomes the primary goal. Thus,
class imbalance is a prominent characteristic of most of the real-life data
(Provost and Fawcett, 2001).
Class imbalance or skewed data refers to data where in one class instances outnumber
the other class instances. The class instances which occupy the majority of the dataset
are known as majority class, while the other class instances are known as minority
class. The most important fact in these kinds of datasets is that minority class instances
are often of much significance and interest to the user.
Most of the learning algorithms work well with balanced dataset in classification
tasks. However, such algorithms fail to work on skewed data. Their accuracy of
classifying majority instances is good but the accuracy of classifying the minority instances
is poor. This happens as learning from such imbalanced or skewed data causes the
learning algorithms to become biased towards the majority class thereby misclassifying
the instances of the minority class. As the minority class holds significant interest, this
issue of misclassification attracts ample research attention.
|