Computer Sciences Journal | Preprocessing Techniques for Effective Data Extraction and Computation

The IUP Journal of Computer Sciences :

Preprocessing Techniques for Effective Data Extraction and Computation

Article Details

Pub. Date	:	Jul, 2013
Product Name	:	The IUP Journal of Computer Sciences
Product Type	:	Article
Product Code	:	IJCS21307
Author Name	:	M Saraswathi and V Balu
Availability	:	YES
Subject/Domain	:	Management
Download Format	:	PDF Format
No. of Pages	:	8

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Abstract

World Wide Web information is semi-structured due to the nested structure of HTML code— a lot of information is linked, and much of the Web information is redundant. Web Text Mining helps the whole knowledge mining process to discover and extract the valuable information from unstructured text. The unstructured texts, which contain massive amount of information, cannot simply be used for further processing by computers. Therefore, this paper discusses the importance of standard preprocessing methods and various steps involved in getting the required content effectively. This paper proposes an effective preprocessing and dimensionality reduction technique, which helps in simplifying or speeding up computations; it can improve the text categorization and performance.

Description

Nowadays, most of the information resources on the World Wide Web are published as HTML pages, and the number of web pages is increasing rapidly with the expansion of the web. In order to make better use of web information, technologies that can automatically reorganize and manipulate web pages are pursued such as web information retrieval, web page classification and other web mining work. web mining aims to discover useful information or knowledge from the web hyperlink structure, page content and usage log. Useful information on the web is often accompanied by a large amount of ‘noise content’, which is the main difference between web pages and traditional texts. Noises are usually incoherent with the topic content of the web page such as advertisements, navigation panels, and copyright announcements (Zhang et al., 2004). It is very difficult for the programs to grasp the topic content of a page, which worsens the quality of web applications. The process of removing noisy content is called data cleaning. Preprocessing operations for data cleaning are stop-word removal, stemming process, removal of double negatives, etc.

Keywords

Computer Sciences Journal, Web Text Mining, Knowledge mining, Preprocessing, Dimensionality reduction, Text clustering.