The New Semantic Similarity Measure using Ontology and Corpus

The IUP Journal of Computer Sciences :

Article Details

Pub. Date	:	January, 2008
Product Name	:	The IUP Journal of Computer Sciences
Product Type	:	Article
Product Code	:	IJCS30801
Author Name	:	P Selvi and N P Gopalan
Availability	:	YES
Subject/Domain	:	Management Studies
Download Format	:	PDF Format
No. of Pages	:	9

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Description

Semantic similarity techniques help in determining how much two concepts or terms are similar according to a given knowledge base. This paper proposes a method for measuring semantic similarity/distance between terms. This measure not only combines strengths but also complements weaknesses of the existing measures that use knowledge base as the primary source. The proposed measure uses a new feature of common specificity (CommSpec) besides the path length feature. The CommSpec feature is derived from (1) information content of concepts; and (2) information content of the knowledge base given a corpus. We evaluated the proposed measure with a benchmark test set of term pairs scored for similarity by human experts. The experimental results demonstrated that the proposed similarity measure is effective and out performs the existing measures. The proposed semantic similarity measure gives the best correlation (0.874) with human scores compared to the existing measures.

The need to determine the degree of semantic similarity between two lexically expressed concepts is a problem that pervades much of computational linguistics. Measures of similarity or relatedness are used widely in such NLP (Natural Language Processing) applications as word sense disambiguation, example-based machine translation, determining discourse structure, text classification, summarization and annotation, information extraction and retrieval, automatic indexing, lexical selection (Rubenstein and Goodenough, 1965).

There are many methods of computing word similarity (or relatedness) nowadays. Generally speaking, they can be classified into two basic methods: one is based on ontology or a semantic taxonomy and the other is based on collocations of words in a corpus. On the other hand, WordNet (Abney and Light, 1999; and Lin, 1998) is particularly well suited for English word similarity measures and many researchers proposed different measures of similarity, which were based on it. These measures vary from simple edge counting to attempts to factor in peculiarities of the network structure by considering link direction (Wu and Palmer, 1994 (hso)), random methods which return random numbers, relative depth or path (random, wup, lch (Wu and Palmer, 1994 and Leacock and Chodorow)), and density (Agirre and Rigau, 1996). These analytic methods now face competition from statistical and machine learning techniques, but a number of hybrid approaches have been proposed that combine a knowledge-rich source, such as a thesaurus, with a knowledge-poor source, such as corpus statistics (res, lin, jcn), (Jiang and Conrath, 1997; Leacock and Chodorow, 1998; Miller and Charles, 1991; and Resnik, 1999). In 2003, Pedersen and Banerjee pointed out the Adapted Lesk (lesk). Richardson et al. (1994); Tedersen et al., and Patwardhan suggested context vector (vector). "Resnik (1999) and Rubenstein and Goodenough (1965) suggested the measure of similarity or relatedness of English words".

This paper, explores the existing semantic similarity measures that use ontology as primary information source, and proposes a new ontology-based measure. The proposed measure is a combination measure using ontology structure and corpus-based features that have a great potential in measuring semantic similarity. Moreover, the proposed measure uses a new feature of Common Specificity (CommSpec) in addition the path length feature. The proposed measure was evaluated with benchmark test set comprising term pairs, scored for similarity by human experts. The experimental results show that the technique is effective, producing the best correlation with human scores in the benchmark test set compared with the existing measures. This paper uses the term `concept node' to denote a concept class represented as a node on ontology, which contains a set of synonymous concepts. The similarity of two concepts belonging to the same node, (i.e., synonymous concepts) reaches maximum, and the similarity of two concepts is the similarity of two concept nodes containing them.

Keywords

New Semantic Similarity, Ontology and Corpus, information content, computational linguistics, information extraction, machine learning techniques, corpus statistics, primary information source, synonymous concepts, corpus-based features.