Enhanced Word Sketches
Recently, I made a request on the ACL SIGLEX mailing list for tools that might help in analyzing preposition lexical samples. In this request, I indicated a need for software that would specifically provide enhanced word sketch analysis. I only received a couple of replies, one of which asked what I meant by this term. I responded, with some vagueness, but the interchange sparked some thoughts that are worth exploring further. In particular, this discussion raised questions about the amount of information in preposition dictionary entries and what might help in expanding these entries. I’d like to expand on this, particularly on the relation between current approaches to word-sense disambiguation (primarily statistical in nature) and what ends up in the dictionary. I think there is still something of a disconnect between the computational community and the lexicographers.
Adam Kilgarriff developed the word sketch engine, which has found great use among lexicographers. The main novelty of word sketches is the reliance on tagged corpora, in which each word receives a part of speech tag. Of particular note is that these tags constitute the terminals in a parse tree. However, the full parse tree showing constituents is not provided. Instead of a parse tree, some chunking of the terminals is performed, primarily using the Corpus Query Language. The sketches generate very useful information, including many syntactic relationships, through this bottom-up approach. To me, there are two major problems: (1) there is no constituent analysis and (2) there is no semantic analysis. Patrick Hanks has extended word sketches somewhat in what he’s doing with Corpus Pattern Analysis. The focus of this analysis is on the identification of syntagmatic patterns (for verbs only). This pattern analysis has a strong affinity to FrameNet, but crucially adds some semantic characterization to the elements of the patterns using a shallow ontology. The development of these patterns is a labor-intensive effort, involving the development of patterns and the tagging of corpus instances with a pattern number until all instances (a sample if there is a large number) are completed. Currently, there is no automatic or semi-automatic tool to facilitate this process.
On the other side of the equation, the computational community has made substantial progress in identifying features useful in word-sense disambiguation (WSD). I’ll stick here to work in preposition WSD, since there is a smaller literature, but will sufficiently illustrate the process. The major contributions stem from the preposition disambiguation task of SemEval-2007 and a comprehensive treatment by O’Hara & Wiebe (2009). In SemEval-2007, there were three particpants: Ye & Baldwin, Yuret, and Popescu et al. Hovy et al. built upon these results to make further advances in preposition disambiguation. While these efforts have identified and refined the set of features useful in preposition disambiguation, up to 85 percent accuracy, I don’t think they have fully explored the potential feature space. To consider how enhanced word sketches might contribute further, we need to examine in detail what these features are and how they correspond to what’s in a dictionary.
Collocation features (the context) are the most important in preposition disambiguation. While the earlier studies focused on context windows, Hovy et al. found that the governor (the word to which a prepositional phrase is attached) and the object are of key importance, both of which are generally found within context windows, with the governor of greater importance than the object. Dictionaries do not generally provide any information about (the class of) the governor. One exception to this may be found in definitions of the preposition of (see the online preposition project data), where several senses characterize the governor. Conversely, the definitions of many verbs and nouns will identify, explicitly or implicitly, an association with a specific preposition. For example, move (go in a specified direction) links well with a sense of to (expressing motion in the direction of (a particular location)); this is the kind of linkage (chain-clarifying relationship) investigated in Popescu et al. More preposition definitions characterize the preposition object, although some specify the semantic role of the object rather than properties of the object itself.
In the several studies, syntactic and semantic features are determined to be of less importance. However, a significant problem with this conclusion is the issue of how well available tools characterize these features. In these studies, semantic characterizations have examined only WordNet-based features. Since WordNet makes no claims about semantic classes, this conclusion must be held in abeyance.
The Preposition Project (TPP) has characterized many properties of each preposition sense. These have not been fully investigated in the several studies. TPP labels each sense according to its Quirk syntax; Hovy et al. used “fronting” (capitalization) as a feature; such a feature could be important for some prepositions but not for others. TPP identifies FrameNet frames and frame elements associated with each sense (based on the available corpus); these constitute an additional type of semantic characterization, possible relevant features that could be investigated. TPP also identifies other prepositions that can substitute; these were used by Yuret, who found that substitutions, while useful for disambiguation, did not work as well as for verbs and nouns. Potentially, these substitutions could be examined in conjunction with the preposition classes built from the TPP data; analysis of disambiguation properties by class has not yet been investigated. To some extent, coarse classes have been investigated through use of Penn Treebank (PTB) data; however, since prepositions were not accorded much prominence in PTB, further study may be warranted along these lines. O’Hara & Wiebe and Hovy et al. both reported good results
In summary, what I’d like an enhanced word sketch to do is to enable me to tag (like a lexicographer), build syntagmatic patterns for each sense, and ultimately to be able to assign (I think) definitive frame elements to each sense. I’d want to put this into a dictionary in such a way that we could have something like a decision tree to identify the appropriate sense. I’m hoping that such a decision tree would greatly facilitate building an appropriate representation of which the PPs are placed in their proper subsidiary roles.