Computer Sciences Journal | Identifying Relevant Snippets from Ranked Web Documents

The IUP Journal of Computer Sciences :

Identifying Relevant Snippets from Ranked Web Documents

Article Details

Pub. Date	:	July, 2011
Product Name	:	The IUP Journal of Computer Sciences
Product Type	:	Article
Product Code	:	IJCS61107
Author Name	:	Shanmugasundaram Hariharan Thirunavukarasu Ramkumar and Selva Muthukumaran
Availability	:	YES
Subject/Domain	:	Management
Download Format	:	PDF Format
No. of Pages	:	10

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Abstract

Internet has brought a major shift or revolution in day-to-day life. It is quite common that millions of documents are posted on the web on a daily basis. The major concern is to identify the right information from the enormous data available. The documents retrieved from commercial search engines are not relevant to the user query. The major issue is to reproduce the right knowledge and deliver it to the web surfer. This has provided the platform for reranking, optimization and several other applications like text summarization, question answering and snippetization. This paper focuses on identifying text snippets for the retrieved web results using statistical approaches. Experiments based on Google search engine presented promising results.

Description

Internet has become an integral part of our lives. Information is retrieved from the enormous amount of data that is often represented as text. This search process is not limited to text alone. However, we are not concerned with any other formats of data. An important function of information processing is to facilitate the search process from textual corpus in a better manner. It is essentially identifying the knowledge base relevant to the query. Specifically, when the user issues a query, the system retrieves a set of text documents which are deemed to be relevant to his query. Typically, there is also a rank associated with each returned document. From analysis, it is shown that the resultant information (web results) is proportionate to the query terms. Hence, spammers try to insert these frequent words into the document and make their documents worthwhile to read. The web surfer unknowingly reads the entire content and finally ends up reading data that is repetitive or not relevant to what the user expected. Such documents can be shortened as tags. It is worth analyzing the obtained tags with the characteristics of the query. This is referred ‘text snippet extraction’ in this paper.

Snippets are short fragments of text extracted from the document content or a metadata. They may be static (for example, always show the first 50 words of the document, or the content of its description metadata, or a description taken from a directory site such as dmoz.org) or query biased. A query biased snippet is one which is selectively extracted on the basis of its relation to the searcher’s query (Andrew Turpin et al., 2007). Various other forms of snippet identification involve pseudo relevance feedback technique (Youngjoong et al., 2007), statistical language models (Qing Li et al., 2010), probabilistic information extraction (Daisy et al., 2010) and Hidden Markov Model (HMM) (Xinxin Wang and Selcuk Candan, 2010). Information retrieval on both unstructured and structured data is widely used for snippet generation and adopting techniques like extraction (Zhenhu et al., 2010), natural language generation (Hirschman and Gaizuaskas et al., 2001), semantics (Iraklis and Sofia 2009; Sven Hartrumpf, 2006; and Xi Bail et al., 2008) and Question Answering (QA) (Andrew Hickl et al., 2007; and Adrian Iftene, 2009). To facilitate the user search experience, various ranking schemes have been proposed so that users can focus on the ones that are deemed highly relevant (Yu Huang, 2008). Fostered by diverse evaluation forums like TREC3, CLEF4, and NTCIIR5, there are important efforts to extend the functionality of existing search engines. The presence of QA systems in the web is still too small compared with traditional search engines. The reason is that QA technology, in contrast to traditional IR methods, is not equally mature for all languages (Alberto et al., 2010). Extracting a query snippet or passage and highlighting the relevant information in long document can help to reduce the result navigation cost of end users. While the traditional approach of highlighting matching keywords helps when the search is keyword-oriented, finding appropriate snippets to represent matches to more complex queries requires novel techniques that can help and characterize the relevance of various parts of a document to the given query, succinctly. In this paper, we present a language model-based method for accurately detecting the most relevant passages of a given document. The authors have designed a query informed segmentation for snippet extraction using OASIS, a system to help and reduce the navigational load of blind users in accessing web-based digital libraries (Qing Li et al., 2007). The performance of web QAS is wholly dependent upon the rank and amount of relevant documents fetched by the search engine (Alejandro et al., 2008). To answer a question, a system must analyze the question, perhaps in the context of some ongoing interaction, it must find one or more answers by consulting online resources and it must present the answer to the user in some appropriate form, perhaps associated with justification or supporting materials. A QA track evaluates systems that answer factual questions by consulting the documents of the TREC corpus. A number of systems in this evaluation have successfully combined information retrieval and natural language processing techniques. Evaluation using reading comprehension tests provides a different approach to QA, based on system’s ability to answer questions about a specific reading passage (Hirschman and Gaizuaskas, 2001).

Keywords

Computer Sciences Journal, Business Intelligence, Enterprise Systems, Enterprise Resource Planning, Customer Relationship Management, CRM, Business Operations Management, Business Process Mining, Finite State Machine, Transactional Information System, Genetic Algorithms, Decision Making Process, Data Mining Tools, Online Analysis Processing, OLAP, Artificial Intelligence.