IUP Publications Online
Home About IUP Magazines Journals Books Archives
     
Recommend    |    Subscriber Services    |    Feedback    |     Subscribe Online
 
The IUP Journal of Computer Sciences :
Preprocessing Techniques for Effective Data Extraction and Computation
:
:
:
:
:
:
:
:
:
 
 
 
 
 
 
 

World Wide Web information is semi-structured due to the nested structure of HTML code— a lot of information is linked, and much of the Web information is redundant. Web Text Mining helps the whole knowledge mining process to discover and extract the valuable information from unstructured text. The unstructured texts, which contain massive amount of information, cannot simply be used for further processing by computers. Therefore, this paper discusses the importance of standard preprocessing methods and various steps involved in getting the required content effectively. This paper proposes an effective preprocessing and dimensionality reduction technique, which helps in simplifying or speeding up computations; it can improve the text categorization and performance.

 
 
 

Nowadays, most of the information resources on the World Wide Web are published as HTML pages, and the number of web pages is increasing rapidly with the expansion of the web. In order to make better use of web information, technologies that can automatically reorganize and manipulate web pages are pursued such as web information retrieval, web page classification and other web mining work. web mining aims to discover useful information or knowledge from the web hyperlink structure, page content and usage log. Useful information on the web is often accompanied by a large amount of ‘noise content’, which is the main difference between web pages and traditional texts. Noises are usually incoherent with the topic content of the web page such as advertisements, navigation panels, and copyright announcements (Zhang et al., 2004). It is very difficult for the programs to grasp the topic content of a page, which worsens the quality of web applications. The process of removing noisy content is called data cleaning. Preprocessing operations for data cleaning are stop-word removal, stemming process, removal of double negatives, etc.

 
 
 

Computer Sciences Journal, Web Text Mining, Knowledge mining, Preprocessing, Dimensionality reduction, Text clustering.