IUP Publications Online
Home About IUP Magazines Journals Books Archives
     
Recommend    |    Subscriber Services    |    Feedback    |     Subscribe Online
 
The IUP Journal of Information Technology
Web Content Mining: Issues and Challenges
:
:
:
:
:
:
:
:
:
 
 
 
 
 
 

Mining the World Wide Web (WWW) is deemed to be a challenging and laborious task for the simple reason that the commercial search engines retrieve irrelevant information for user queries. The documents retrieved by the existing commercial search engines are enormous in nature and the document repository is populated on a day-to-day basis. As a result, it becomes a tricky and unaffordable scenario for a web surfer. The content obtained also seems to be repetitive, unrelated, or sometimes useless. Though many commercial search engines exist today, each has its own pros and cons. Each system has its own search procedure which is being analyzed by several researchers. This paper details the hierarchy of web mining and thereby provides a complete analysis of the challenges and future directions for efficient web search process.

 
 

World Wide Web (WWW), also shortly termed as Internet, is a dump of potentially useful materials which range from several millions or trillions of webpages or other forms of media like text, audio, video structures indexed and stored in the form of databases. However, storing these information as database itself is a challenging task. Commercial search engines scour these databases based on the queries the user provides through the user interface for pattern taxonomy extraction (Sheng-Tang et al., 2004). The report by WorldWideWebsize.com says that the Indexed Web contains at least 12.73 billion pages. Search has become the default gateway to the web. Inspite of the availability of huge data, it is obvious that the retrieval of information matching the user’s target is still an open challenge. It is also reported that approximately 10-15% of information available on the web is noted as spam (act of inserting specific information repeatedly to boost the original value of the corpora).

Let us have a quick glance at the unbelievable statistics posted on the web by royal pingdom (http://royal.pingdom.com/2011/01/12/internet-2010-in-numbers/) as given below, which shows the rate of growth of Internet users in different forms of ad usages (during the year 2010). The statistics on domain names is as follows: 88.8 million (.COM), 13.2 million (.NET), 8.6 million (.ORG), 202 million—(across all top-level domains). The Internet users are 1.97 billion worldwide which is a 14% increase since the previous year (2009), 825.1 million users from Asia, 475.1 million from Europe, 266.2 million from North America, 204.7 million from Latin America/Caribbean, 110.9 million from Africa, 63.2 million from Middle East and 21.3 million from Oceania/ Australia. The survey statistics clearly reveal that Internet usage in different forms in Asia is more compared to other countries.

The impact of social media as observed is: 152 million blogs on the Internet; 25 billion tweets on Twitter, with 100 million accounts added on Twitter in 2010; 175 million people on Twitter; 7.7 million fans of Lady Gaga, Twitter’s most followed user; 600 million people on Facebook; 250 million new user on Facebook; 30 billion pieces of content like links, notes, photos, etc. shared on Facebook per month; 70% share of Facebook users from outside the US; and 20 million Facebook apps. installed each day.

Till date, quite a number of interesting studies specific to several languages, especially non-English languages like Arabic, French, Greek, Hebrew, Slovenian, Swedish and Turkish have been carried out (Judit and Tatyana, 2005). The studies carried out in this cross-language information retrieval can serve as starting points for improving the search results of non-English queries on the web. The Cross-Language Evaluation Forum (CLEF) (http://clef-campaign.org/) promotes R&D in multilingual information access by developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating in European languages in both monolingual and cross-language contexts. But our research focus is on English language only and we would attempt to implement our algorithms for several other languages to test domain independency.

 
 

Information Technology Journal, Web mining, Semantic web, Hyperlink, Search engine, Optimization, Internet.