Nowadays, most of the information resources on the World Wide Web are published as
HTML pages, and the number of web pages is increasing rapidly with the expansion of
the web. In order to make better use of web information, technologies that can
automatically reorganize and manipulate web pages are pursued such as web
information retrieval, web page classification and other web mining work. web mining
aims to discover useful information or knowledge from the web hyperlink structure,
page content and usage log. Useful information on the web is often accompanied by a
large amount of ‘noise content’, which is the main difference between web pages and
traditional texts. Noises are usually incoherent with the topic content of the web page
such as advertisements, navigation panels, and copyright announcements (Zhang et al.,
2004). It is very difficult for the programs to grasp the topic content of a page, which
worsens the quality of web applications. The process of removing noisy content is called
data cleaning. Preprocessing operations for data cleaning are stop-word removal,
stemming process, removal of double negatives, etc.
|