Preposition Disambiguation: State of the Art

Efforts to disambiguate prepositions have been increasing in the last few years, with claims of precision reaching 0.80. All such efforts present results in statistical generalities, with identification of the key factors related to the results. Continued progress in these efforts requires a close examination of limitations that have been noted. In addition, the exploitation of these results requires a close examination of the factors associated with each sense, so that the relevant information for each can be encoded in a meaningful way. This post summarizes the current literature on preposition disambiguation as a prelude to further developments of the data to be encoded in The Preposition Project (see sidebar link).

Meaningful preposition disambiguation requires a reasonably well-drawn sense inventory and a set of corpus instances that have been disambiguated by hand. The Preposition Project (TPP) provides these. They were used in SemEval 2007 and have since been used in other studies; such studies constitute the basic set of references. Another important study, initiated prior to TPP, is O’Hara and Wiebe (2009).

The basic SemEval task of disambiguating prepositions is described in Litkowski & Hargraves (2007). There were three participants in this task, whose results are described in Ye & Baldwin (2007), Yuret (2007), and Popescu, Tonelli, & Pianta (2007). Two later studies have also used the SemEval datasets or TPP data: Tratz & Hovy (2009) and Dahlmeier, Ng,  and Schultz (2009).

Each of these studies makes use of statistical techniques (decision trees, maximum entropy, likelihood, and chain-clarifying relationships) to identify significant features associated with preposition disambiguation. These features are not identical, in some part as a result of analyzing different pieces of information. One major difficulty is the lack of any consensus on what is used to classify noun and verb types.

Ye & Baldwin used maximum entropy with three types of features: collocational features (open class words, WordNet synsets, named entities, surrounding words, and surrounding supersenses), syntactic features (parts of speech, chunk tags and types, and parse tree features), and semantic role features (semantic role tags, attached verbs, and verb relative positions). They found that collocational features played the most significant role. Tratz & Hovy also used maximum entropy, but focused on syntactic structures for identifying words of interest. They found the verb/noun dominating the prepositional phrase, the noun/verb object of the preposition, the subject of the dominating verb, neighboring prepositional phrases, and words within 2 positions of the target. For each word so identified, they then constructed feature sets consisting of the word itself, the lemma, part of speech, synset members, hypernyms, and capitalization. They achieved an 8 percent improvement in disambiguation and concluded that words bearing some syntactic relation to the target preposition were responsible for the improvement.

Yuret examines the context of the target preposition using a statistical language model. His method is applied more generally to content words, where he looks at the general task of word-sense disambiguation using possible substitutes as a way of selecting an applicable sense. The method depends on a rich set of substitutes, which is not the case for prepositions. He makes the point that good quality substitutes for prepositions are unlikely, since they play a unique role in language. Notwithstanding, his results are sufficiently above the baseline and are supportive of the Ye & Baldwin conclusions that collocational features are important. In addition, his results suggest that the TPP data for “other prepositions” associated with each sense might allow corpus instances for the different prepositions to be studied together.

Popescu et al. also examine the context of the preposition, but with the hypothesis that the collocational features constitute a mutual disambiguation process (chain-clarifying relationships). Their performance in SemEval was limited by the fact that the context words for the prepositions were not themselves disambiguated. Their method is based on a learning algorithm based on a supervised assignment of training instances (known as Angluin’s algorithm), but was limited to more superficial identification of the surrounding features.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>