A Word Embeddings Approach to Predicting the Compositionality of Idiomatic Expressions
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
A significant part of each natural language consists of Multiword Expressions (MWEs), which must be handled appropriately for multiple Natural Language Processing (NLP) applications. WordNet 1.7, one of the largest lexical databases of English language, contains 41% MWEs records (Fellbaum, 1998).
Idiomatic Expression is a type of MWEs that can be defined as a phrase with idiosyncrasies that cannot be acquired from its word components. Sag et al. (2002) claim that the non- compositionality feature of idiomatic expressions presents problems for various NLP tasks, as basic grammatical rules cannot be used to identify these expressions directly.
Various NLP researches have used word embeddings models as a statistical method for evaluating the compositionality of MWEs since the vector space of these models can capture the meanings of words, since comparable vectors indicate words with similar contexts. Therefore, in this project, the term embeddings models will be used as a semantic measure for predicting the compositionality of MWEs that are semantically non-compositional (i.e., idiomatic expressions) in a corpus.
Two word embeddings models, Word2vec and Context2vec, were trained to predict the compositionality of idiomatic expressions. After training the models, their automatic MWEs compositionality scores were compared. Then an intrinsic evaluation was carried out to evaluate the performance of Word2vec and Context2vec against an MWEs dataset evaluated by human annotators.
Our findings show that Word2vec models outperform Contex2vec for this project's task. Word2vec achieved compositionality scores in line with human annotators' judgement and consumed less training time than Contex2vec.