World Wide Web (WWW), also shortly termed as Internet, is a dump of potentially
useful materials which range from several millions or trillions of webpages or other
forms of media like text, audio, video structures indexed and stored in the form of
databases. However, storing these information as database itself is a challenging task.
Commercial search engines scour these databases based on the queries the user provides
through the user interface for pattern taxonomy extraction (Sheng-Tang et al., 2004).
The report by WorldWideWebsize.com says that the Indexed Web contains at least
12.73 billion pages. Search has become the default gateway to the web. Inspite of the
availability of huge data, it is obvious that the retrieval of information matching the
user’s target is still an open challenge. It is also reported that approximately 10-15% of
information available on the web is noted as spam (act of inserting specific information
repeatedly to boost the original value of the corpora).
Let us have a quick glance at the unbelievable statistics posted on the web by royal
pingdom (http://royal.pingdom.com/2011/01/12/internet-2010-in-numbers/) as given
below, which shows the rate of growth of Internet users in different forms of ad usages
(during the year 2010). The statistics on domain names is as follows: 88.8 million
(.COM), 13.2 million (.NET), 8.6 million (.ORG), 202 million—(across all top-level
domains). The Internet users are 1.97 billion worldwide which is a 14% increase since
the previous year (2009), 825.1 million users from Asia, 475.1 million from Europe,
266.2 million from North America, 204.7 million from Latin America/Caribbean, 110.9
million from Africa, 63.2 million from Middle East and 21.3 million from Oceania/
Australia. The survey statistics clearly reveal that Internet usage in different forms in
Asia is more compared to other countries.
The impact of social media as observed is: 152 million blogs on the Internet;
25 billion tweets on Twitter, with 100 million accounts added on Twitter in 2010;
175 million people on Twitter; 7.7 million fans of Lady Gaga, Twitter’s most followed
user; 600 million people on Facebook; 250 million new user on Facebook; 30 billion
pieces of content like links, notes, photos, etc. shared on Facebook per month; 70%
share of Facebook users from outside the US; and 20 million Facebook apps. installed
each day.
Till date, quite a number of interesting studies specific to several languages, especially
non-English languages like Arabic, French, Greek, Hebrew, Slovenian, Swedish and
Turkish have been carried out (Judit and Tatyana, 2005). The studies carried out in
this cross-language information retrieval can serve as starting points for improving the
search results of non-English queries on the web. The Cross-Language Evaluation
Forum (CLEF) (http://clef-campaign.org/) promotes R&D in multilingual information
access by developing an infrastructure for the testing, tuning and evaluation of
information retrieval systems operating in European languages in both monolingual
and cross-language contexts. But our research focus is on English language only and
we would attempt to implement our algorithms for several other languages to test
domain independency.
|