Automatic text summarization is an important and challenging area of natural
language processing (NLP). Research on automatic summarization that includes extracting,
abstracting has a long history with an early burst of effort in 1960s following some pioneering work
[1, 2]. The task of a text summarizer is to produce a synopsis of any document or a set
of documents submitted. A summary can be of a single document or multiple documents,
generic (author's perspective) or query-oriented (user specific) [3], indicative (using
keywords indicating the central topics) or informative (content-oriented) [4]. A summary can be
an extract, i.e., certain portions (sentences or phrases) of the text is reproduced,
whereas producing an abstract involves breaking down of the text into a number of different
key ideas, fusion of specific ideas to get more general ones, and then generation of new
sentences dealing with these new general ideas. Thus summarization system falls into at least one
and often more than one slot in each of the main categories above and thus must also be
evaluated along several dimensions using different measures [5]. In our work, we have focused on
a generic, extractive summaries and evaluation of the results with user-generated target.
In a multi-document summarization system, the main task is to merge the documents
or subset of summaries, where the process identifies pairs of sentences that have similarity
in content. Attempts on organizing information for multi-document summarization,
has received relatively little attention. While sentence ordering for single
document summarization can be determined from the ordering of sentences in the input article, this
is not the case for multi-document summarization where summary sentences may be
drawn from different input articles.
In this paper, we propose a methodology for merging information in text documents.
The process of merging is challenging and tricky; it should recognize similarity of two
sentences containing the same content, so that this information appears in the resulting summary
only once; it should also recognize whether information is repetitive or identified as subset of
the other (information in one sentence is available in the other sentence). |