Internet has become an integral part of our lives. Information is retrieved from
the enormous amount of data that is often represented as text. This search process
is not limited to text alone. However, we are not concerned with any other formats
of data. An important function of information processing is to facilitate the search
process from textual corpus in a better manner. It is essentially identifying the
knowledge base relevant to the query. Specifically, when the user issues a query,
the system retrieves a set of text documents which are deemed to be relevant to
his query. Typically, there is also a rank associated with each returned document.
From analysis, it is shown that the resultant information (web results) is proportionate
to the query terms. Hence, spammers try to insert these frequent words into the
document and make their documents worthwhile to read. The web surfer unknowingly
reads the entire content and finally ends up reading data that is repetitive or
not relevant to what the user expected. Such documents can be shortened as tags.
It is worth analyzing the obtained tags with the characteristics of the query. This is referred ‘text snippet extraction’ in this paper.
Snippets are short fragments of text extracted from the document content or a metadata.
They may be static (for example, always show the first 50 words of the document,
or the content of its description metadata, or a description taken from a directory
site such as dmoz.org) or query biased. A query biased snippet is one which is selectively
extracted on the basis of its relation to the searcher’s query (Andrew Turpin et
al., 2007). Various other forms of snippet identification involve pseudo relevance
feedback technique (Youngjoong et al., 2007), statistical language models (Qing
Li et al., 2010), probabilistic information extraction (Daisy et al., 2010) and
Hidden Markov Model (HMM) (Xinxin Wang and Selcuk Candan, 2010). Information retrieval
on both unstructured and structured data is widely used for snippet generation and
adopting techniques like extraction (Zhenhu et al., 2010), natural language generation
(Hirschman and Gaizuaskas et al., 2001), semantics (Iraklis and Sofia 2009; Sven
Hartrumpf, 2006; and Xi Bail et al., 2008) and Question Answering (QA) (Andrew Hickl
et al., 2007; and Adrian Iftene, 2009). To facilitate the user search experience,
various ranking schemes have been proposed so that users can focus on the ones that
are deemed highly relevant (Yu Huang, 2008). Fostered by diverse evaluation forums
like TREC3, CLEF4, and NTCIIR5, there are important efforts to extend the functionality
of existing search engines. The presence of QA systems in the web is still too small
compared with traditional search engines. The reason is that QA technology, in contrast to traditional IR methods, is not equally mature for all languages (Alberto et al.,
2010). Extracting a query snippet or passage and highlighting the relevant information
in long document can help to reduce the result navigation cost of end users. While
the traditional approach of highlighting matching keywords helps when the search
is keyword-oriented, finding appropriate snippets to represent matches to more complex
queries requires novel techniques that can help and characterize the relevance of
various parts of a document to the given query, succinctly. In this paper, we present
a language model-based method for accurately detecting the most relevant passages
of a given document. The authors have designed a query informed segmentation for
snippet extraction using OASIS, a system to help and reduce the navigational load
of blind users in accessing web-based digital libraries (Qing Li et al., 2007).
The performance of web QAS is wholly dependent upon the rank and amount of relevant
documents fetched by the search engine (Alejandro et al., 2008). To answer a question,
a system must analyze the question, perhaps in the context of some ongoing interaction, it must find one or more answers by consulting online resources and it must present
the answer to the user in some appropriate form, perhaps associated with justification
or supporting materials. A QA track evaluates systems that answer factual questions
by consulting the documents of the TREC corpus. A number of systems in this evaluation
have successfully combined information retrieval and natural language processing
techniques. Evaluation using reading comprehension tests provides a different approach
to QA, based on system’s ability to answer questions about a specific reading passage
(Hirschman and Gaizuaskas, 2001).
|