|
Semantic similarity techniques help in determining how
much two concepts or terms are similar according to a given
knowledge base. This paper proposes a method for measuring
semantic similarity/distance between terms. This measure
not only combines strengths but also complements weaknesses
of the existing measures that use knowledge base as the
primary source. The proposed measure uses a new feature
of common specificity (CommSpec) besides the path length
feature. The CommSpec feature is derived from (1) information
content of concepts; and (2) information content of the
knowledge base given a corpus. We evaluated the proposed
measure with a benchmark test set of term pairs scored for
similarity by human experts. The experimental results demonstrated
that the proposed similarity measure is effective and out performs
the existing measures. The proposed semantic similarity
measure gives the best correlation (0.874) with human scores
compared to the existing measures.
The need to determine the degree of semantic similarity
between two lexically expressed concepts is a problem that
pervades much of computational linguistics. Measures of
similarity or relatedness are used widely in such NLP (Natural
Language Processing) applications as word sense disambiguation,
example-based machine translation, determining discourse
structure, text classification, summarization and annotation,
information extraction and retrieval, automatic indexing,
lexical selection (Rubenstein and Goodenough, 1965).
There are many methods of computing word similarity (or
relatedness) nowadays. Generally speaking, they can be classified
into two basic methods: one is based on ontology or a semantic
taxonomy and the other is based on collocations of words
in a corpus. On the other hand, WordNet (Abney and Light,
1999; and Lin, 1998) is particularly well suited for English
word similarity measures and many researchers proposed different
measures of similarity, which were based on it. These measures
vary from simple edge counting to attempts to factor in
peculiarities of the network structure by considering link
direction (Wu and Palmer, 1994 (hso)), random methods which
return random numbers, relative depth or path (random, wup,
lch (Wu and Palmer, 1994 and Leacock and Chodorow)), and
density (Agirre and Rigau, 1996). These analytic methods
now face competition from statistical and machine learning
techniques, but a number of hybrid approaches have been
proposed that combine a knowledge-rich source, such as a
thesaurus, with a knowledge-poor source, such as corpus
statistics (res, lin, jcn), (Jiang and Conrath, 1997; Leacock
and Chodorow, 1998; Miller and Charles, 1991; and Resnik,
1999). In 2003, Pedersen and Banerjee pointed out the Adapted
Lesk (lesk). Richardson et al. (1994); Tedersen et al.,
and Patwardhan suggested context vector (vector). "Resnik
(1999) and Rubenstein and Goodenough (1965) suggested the
measure of similarity or relatedness of English words".
This paper, explores the existing semantic similarity measures
that use ontology as primary information source, and proposes
a new ontology-based measure. The proposed measure is a
combination measure using ontology structure and corpus-based
features that have a great potential in measuring semantic
similarity. Moreover, the proposed measure uses a new feature
of Common Specificity (CommSpec) in addition the path length
feature. The proposed measure was evaluated with benchmark
test set comprising term pairs, scored for similarity by
human experts. The experimental results show that the technique
is effective, producing the best correlation with human
scores in the benchmark test set compared with the existing
measures. This paper uses the term `concept node' to denote
a concept class represented as a node on ontology, which
contains a set of synonymous concepts. The similarity of
two concepts belonging to the same node, (i.e., synonymous
concepts) reaches maximum, and the similarity of two concepts
is the similarity of two concept nodes containing them.
|