Establishing the equivalence of text segments is a task that can be accomplished at many levels. Prepositions are an important vehicle for indicating semantic roles, which can be used to identify where text should be examined for equivalenc. The Preposition Project is disambiguating prepositions in the FrameNet corpus using a sense inventory from a current dictionary and guided by a comprehensive treatment of preposition meaning. The project is also specifically generating a set of data that describe semantic role alternation patterns, independent of prepositions. The nature of this data is described and procedures for using it in question answering and summarization are outlined. We describe how this data may provide an additional tool for establishing semantic and paraphrase equivalence.

Establishing the semantic equivalence of two text segments is important in many NLP applications. In factual question answering, it is necessary to recognize the equivalence not only of multiple phrases that might answer the question, but also that the surrounding text is sufficiently similar to justify assertions that the question is being answered. In summarization, it is necessary to examine many sentences within a text to identify distinctly different concepts phrased in different ways, so that a summary does not present redundant information. In general, the task of establishing equivalence or creating paraphrases involves two levels of analysis, one at the phrase level (usually noun phrases) and the other at a clausal or sentential level (different orderings of the phrases).

At the phrase level, the techniques for determining equivalence usually involve synonym detection or establishing referential equivalence (via coreferences, anaphora, and definite noun phrases). At the clausal or sentential level, however, the methods are usually much more subjective and not as amenable to computational methods.

Verb diathesis alternations (e.g., Levin (1993)) provides some basis for examining a variety of potential equivalences based on arguments, but their coverage is limited somewhat since they rely on elements of meaning that are quite closely related to the meaning of the verbs. Frame semantics offers a more general solution, but has not been studied quite as thoroughly. A new source of data on preposition behavior based on the FrameNet datasets may prove useful for examining alternation patterms in frame elements or semantic roles. This data is being made available in The Preposition Project.

In the following sections, we (1) describe the preposition project and the data being generated, (2) examine the alternation data that underlies the current study, (3) lay out procedures for using this data in NLP applications, and (4) compare this approach with selected research on paraphrasing and semantic knowledge representation.

The Preposition Project (TPP) is designed to provide a comprehensive database of preposition senses suitable for use in natural language processing. It is attempting to fine-tune the distinctions within and among prepositions in a native speaker dictionary (the Oxford Dictionary of English, 2003) by comparing and contrasting them with the treatment of prepositions in two other sources: the instances of prepositions that are functionally tagged in FrameNet, and the treatment of prepositions in a traditional English grammar (Quirk et al., 1985).

Each of 847 preposition senses for 373 prepositions (including phrasal prepositions) will be characterized with a semantic role name and the syntactic and semantic properties of its complement and attachment point. Each sense will be further described by (1) a link to its definition in the Oxford Dictionary of English, (2) its basic syntactic function and meaning as described in Quirk et al. (1985), (3) other prepositions filling a similar semantic role, (4) FrameNet frames and frame elements, and (5) other syntactic forms in which the semantic role may be realized.

The primary source of data is the set of corpus instances drawn from the FrameNet database of sentences tagged with semantic roles (frame elements) for each preposition. Since the FrameNet database was not constructed with prepositions in mind, the identification of frame elements using a preposition provides an independent, unbiased corpus that considerably facilitates the construction of a high-quality preposition database.

The Oxford Dictionary of English (ODE, 2003) (and its predecessor, the New Oxford Dictionary of English (NODE, 1997)) was chosen as the source of the preposition sense inventory because of the clarity and organization of its senses and its reliance on corpus evidence in its construction. Litkowski (2002) describes how prepositions in NODE were identified, particularly procedures used for identifying phrasal prepositions that are not accorded headword status and appear, unlabeled as prepositions, under other headwords. As indicated there, 373 prepositions (listed in the appendix to the paper) and 847 preposition senses were identified. These form the basis for TPP's sense inventory.

The initial focus of TPP is on the most common and most polysemous prepositions. After selecting a preposition for study (by, through, with, for, and of have been completed at the time of submission), the FrameNet corpus instances of the preposition are obtained using CL Research's publicly available FrameNet Explorer (FNE).

The FrameNet database includes approximately 7,500 XML lexical unit files, each of which contains tagged sentences for a specific lexical item and frame (e.g., the item move.v in the Motion frame). Tagged sentences are grouped into subcorpora, each of which has a name. The name encodes salient syntactic properties of the subcorpus, e.g., V-730-s20-ppacross, which includes sentences using the verb move that include a prepositional phrase beginning with across (which are tagged as instances of the Path frame element within the Motion frame).

FNE is used initially to generate a text file serving as an index of all tagged FrameNet instances of a given preposition. FNE searches each of the lexical unit files to find subcorpora that have ppprep in the name. For each subcorpus having the target name (e.g., ppby), a line is written to the text file, containing five elements: the frame name, the frame element, the lexical unit, the subcorpus name, and the sentence ID. The text file is then imported into an Excel spreadsheet (800 to 4800 instances for the prepositions mentioned above).

Using the instance file as a guide, a lexicographer begins analyzing the preposition’s senses. A separate Excel spreadsheet is devised for the preposition, with one row for each sense. The lexicographer examines the definitions for the preposition, available information about the preposition in Quirk et al., and the FrameNet corpus instances. On the basis of this information, the lexicographer assigns a semantic role name, intended to be a characterization of the sort of information that the given preposition introduces. Based on the definition and the corpus instances, the lexicographer then sets out to characterize the syntactic and semantic properties of the sense's complement and attachment point, based on an interpretation of the definition.

Next, in the instance spreadsheet, the lexicographer assigns a sense number to each sentence instance. The lexicographer uses FNE for this purpose, using the information provided above for each corpus instance. FNE has the facility to display all annotated instances of a lexical unit (such as arrest.v) entered on its search screen. In addition, all subcorpus names are displayed in a drop-down list; by selecting the relevant subcorpus (e.g.,

Frame	Frame Element	Lexical Unit	GF	PT	Preposition
Arriving	Mode_of_transportation	arrive.v	Comp	PP	by
Arriving	Mode_of_transportation	arrive.v	Comp	PP	in
Arriving	Mode_of_transportation	come.v	Comp	PP	by
Arriving	Mode_of_transportation	return.n	Comp	PP	by
Arriving	Path	approach.v	Comp	PP	on
Arriving	Path	approach.v	Comp	PP	through
Arriving	Path	approach.v	Comp	PP	via
Arriving	Path	arrive.v	Comp	PP	through
Arriving	Path	arrive.v	Comp	PP	via
Arriving	Path	come.v	Comp	PP	round
Arriving	Path	come.v	Comp	PP	through
Arriving	Path	come.v	Comp	PP	via
Arriving	Path	come.v	Obj	NP
Arriving	Path	enter.v	Comp	PP	at
Arriving	Path	enter.v	Comp	PP	by
Arriving	Path	enter.v	Comp	PP	through
Arriving	Path	enter.v	Comp	PP	via
Arriving	Path	get.v	Comp	PP	past
Arriving	Path	reach.v	Comp	PP	by
Arriving	Path	reach.v	Comp	PP	through
Arriving	Path	reach.v	Comp	PPing
Arriving	Path	return.n	Comp	PP	towards
Arriving	Path	return.v	Comp	PP	across

Table 1. Variations in Syntactic Realizations of a Frame Element for ‘by’

V-730-s20-ppby), the lexicographer can view just those sentences. The lexicographer can then determine which ODE sense of the preposition is applicable. Since similar items may be grouped together (i.e., frame name, frame element name, and lexical unit), several instances can be tagged at a time.

The tagged instance spreadsheet provides the basis for generating several other files.

With the tagged instances, a simple sort by sense number of the Excel spreadsheet identifies the (Frame Frame_Element) pairs for each sense. These are aggregated by sense using a Perl script. (See Litkowski & Hargraves (2005) for a description of how these aggregations can be used to analyze the semantic role for a sense.)

A tagged sentence in the FrameNet database identifies a specific frame element within a specific frame for the prepositional phrase introduced by the preposition. The frame element and frame can be used as a seed to find other ways recorded in FrameNet for realizing the combination. For example, as shown in Table 1, by introduces the frame element Mode_of_transportation or Path in the Arriving frame. The FrameNet database can be queried to determine other prepositions and other syntactic realizations in which these frame elements occur.

The distinct patterns in which these occur are summarized by identifying all unique occurrences of (Frame Frame_Element Lexical_Unit Grammatical_Function Phrase_Type Preposition) within the database. Preposition is included only when the Phrase_Type is PP. There may be many sentences that have been tagged similarly, but only unique occurrences need to be identified to examine the distribution of the same frame element.

In Table 1, several combinations are evoked by the seed element. The Mode_of_transportation frame element was seeded by the instances for arrive.v and/or come.v (sense 8 of by); the Path element was evoked by the instances for enter.v (sense 5 of by). It can be seen that in addition to by, in is also used to indicate the Mode_of_transportation frame element, also as a Complement to the main verb. For the Path frame element, in addition to by, the prepositions on, through, via, round, past, towards, and across are used. The Path frame element is also expressed as the Direct Object for one verb, come.

Data generated as in Table 1 are first used to identify other prepositions labeled by the FrameNet lexicographers as reifying the same (Frame FrameElement) combination. For by, 8548 lines like those in Table 1 were generated, of which 3872 had a PP Phrase Type. Similar numbers of lines were generated for the other prepositions. A Perl script is used to aggregate these other prepositions for each sense. (See Litkowski and Hargraves (2005) for further discussion of these synonymic prepositions.)

In general, realizations of the same frame element in prepositional phrases occur in only about 35 or 40 percent of lines generated as in Table 1. The other lines, such as the one for come in Table 1, form the basis for examining other syntactic realizations of the same semantic role.

In a second example (not shown), 52 lines were generated for the Cure:Treatment combination from a single instance of through, via the verb rehabilitate.v (sense 12, labeled Intermediary by the lexicographer, but essentially a means semantic role). The Cure:Treatment pair is associated with a much greater range of lexical items, including not only verbs (alleviate, cure, ease, heal, rehabilitate, resuscitate, and treat), but also nouns (cure, healer, palliation, remedy, therapist, therapy, and treatment) and adjectives (curative, palliative, rehabilitative, and therapeutic). Only 16 of these lines had a PP Phrase Type.

For example, the alternation patterns for expressing the Treatment frame element appear to vary by part of speech of the lexical item. Each of the lexical items identified above is the target word in a frame and frame element appears in a particular phrase type fulfilling some grammatical function in a sentence that has been tagged by the FrameNet lexicographers.

For the verbs, the Treatment frame element appears as an external argument, i.e., the subject of the verb ("Ext NP"), a complement prepositional phrase containing a gerund ("Comp PPing"), a complement adverbial phrase, e.g. treated pharmacologically ("Comp AVP"), or a definite null instantiation, indicating that the element is an anaphor ("DNI"). For the nouns, it appears as the subject of copular verb (“Ext NP”), a complement of the target word (“Comp NP”), a modifying noun (“Mod N”), a modifying adjective (“Mod AJP”), a gerundial phrase linked by a copula (“Comp VPing”), an adjective linked by a copula (“Comp AJP”), and indefinite null instantiation, indicating that there is no referent (“INI”). For adjectives, it appears as the subject of a copula verb (“Ext NP”), a complement to the adjective (“Comp NP”), and the head noun or noun phrase (“Head N” and “Head NP”).

In the FrameNet corpus, there is not an instance of each alternation pattern for every lexical unit. Thus, “Comp AVP” appears only for the verb treat. However, it seems that each alternation pattern is valid for each of the lexical items (i.e., one can easily construct an uncontrived example).

Although the Treatment frame element is identified as a core frame element in the Cure frame, it does not appear to be an essential component of the meaning of any of the lexical units. Thus, examining the definitions of the lexical units shows that there is no requirement for a collocative identification of a Treatment. This is particularly true of the verbs, where one might expect that a lexical preference for at least one sense would specify a possible subject that is a treatment, whereas the grammar codes in the definitions do provide some diathesis alternations. This absence suggests that frames may usefully be associated with individual senses (in at least an electronic version of a dictionary).

Associating frames with definitions raises several questions, which can only be briefly mentioned in this paper. As suggested, the alternation patterns seem applicable to several lexical items; a dictionary might capture this commonality by providing some inheritance mechanism. Within the FrameNet corpus, many lexical items have more than one associated frame, i.e., the lexical items are polysemous. In the Senseval-3 semantic roles task (Litkowski, 2004), participants were able to make role assignments at a very high level compared with earlier work (Gildea & Jurafsky, 2002), but in this task, they were provided with the frame name and did not have to disambiguate to identify the relevant frame. However, it is likely that disambiguation would not be a significant problem, particularly since simple knowledge about associated frame elements (i.e., slots) and their possible syntactic fillers would provide significant additional information that could be used in disambiguation.

Another important issue is the coverage of FrameNet. Currently, about 6000 distinct lexical units are present in the 7500 lexical unit files. No studies of coverage have been performed to identify whether other lexical items could inherit from the existing set of about 700 frames or how many additional frames would be necessary. Additionally, the interaction of several frames within a sentence has not been investigated. Typically, only a few phrases in a FrameNet sentence are tagged, and these only based on a specific frame.

Assuming the availability of semantic role alternation patterns for lookup, disambiguation, and representation, this information would provide many possibilities for exploitation in NLP applications such as question answering and text summarization. In all likelihood, this would have to be integrated with a full parsing of texts, although it might be possible to achieve results with shallow parsing. For question answering, where answers might first be sought by transforming a question into a canonical form with an empty slot, the alternation patterns would provide a principled way for looking elsewhere in a sentence. For summarization, labeling of text with semantic roles make it somewhat easier to apply techniques for paraphrase equivalence and for detection of redundant information. This would make it easier to perform multiple document summarization that either extracts individual sentences that go into a summary or generates a summary from building blocks. In both applications, paraphrasing or detection of semantic equivalence is essential.

Quirk et al. (2004) present several techniques for paraphrase generation based in large part on the manual alignment of sentences thought to be likely paraphrase sentence pairs from news articles available on the Web. As indicated in Dolan et al. (2004), paraphrase alternations consisted of elaborations, phrasal differences, spelling, synonymy, anaphora, and reordering. The alignment guide provided to the taggers recognized the different types of alternations and provided rules for assessing equivalence.

The authors acknowledge, however, that their methods (primarily relying on edit distance) did not identify “interesting” sentence pairs that were similar in meaning. They suggest that further exploration of linguistic features and discourse structure (among other strategies) may yield more sentence pairs, as well as provide the basis for better automated metrics for paraphrase evaluation. Our initial results suggest that the semantic role alternation patterns from The Preposition Project may provide a rich set of data that can be used in pursuing these objectives.

Fiszman et al. (2003) describe a system for extracting propositions (particularly treatment propositions) from biomedical texts. For example, from the sentence, “Alfuzosin is effective in the treatment of benign prostatic hyperplasia,” the proposition “Alfuzosin-TREATS-Prostatic Hypertrophy” is extracted. While this system depends crucially on a semantic network that allows a certain amount of reasoning, the component that processes the sentence involves the use of semantic alternation patterns. In particular, the sentence processor makes use of indicator rules that involve specification of semantic arguments for the word treatment. These rules have been developed manually and typically include several patterns associated with individual words. As indicated above in the discussion of the Cure:Treatment frame element, the types of alternation patterns may provide a capability for automatically generating indicator rules.

Although only five prepositions have been analyzed in detail thus far in The Preposition Project, their roughly 80 definitions provide a substantial resource that extends to many other prepositions via synonymy and inheritance. In addition, FrameNet Explorer can be used independently of the project to generate semantic alternation patterns for all of the the FrameNet frames. These patterns can be aggregated by lexical units and verb sets (perhaps close to verb classes) and can be integrated into NLP applications such as question-answering and text summarization.

While the initial focus has been on preposition behavior, the semantic role alternations suggest the value of the FrameNet data for paraphrase opportunities. In addition, the utility of the alternation patterns can be evaluated systematically by determining their value in benchmark test sets developed for question answering, text summarization, and paraphrase assessments.

Dolan, W., Quirk, C., and Brocket, C. (2004) Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. Proceedings of COLING-2004.

Fiszman, M., Rindflesch, T., & Kilicoglu, H. (2003). Integrating a Hypernymic Proposition Interpreter into a Semantic Processor for Biomedical Texts. Proceedings of the AMIA Annual Symposium on Medical Informatics.

Gildea, Daniel, and Daniel Jurafsky. (2002) Automatic Labeling of Semantic Roles. Computational Linguistics, 28 (3), 245-288.

Levin, Beth. (1993) English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press: Chicago.

Litkowski, K. C. (2002). Digraph Analysis of Dictionary Preposition Definitions. Word Sense Disambiguation: Recent Success and Future Directions. Philadelphia, PA: Association for Computational Linguistics.

Litkowski, K. C. (2004). Senseval-3 Task: Automatic Labeling of Semantic Roles. Proceedings of Senseval-3. Association for Computational Linguistics.

Litkowski, K. C. & Orin Hargraves (2005). The Preposition Project. Second ACL-SIGSEM Workshop on The Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications. Colchester, England: University of Essex.

The New Oxford Dictionary of English. (1998) (J. Pearsall, Ed.). Oxford: Clarendon Press.

The Oxford Dictionary of English. (2003) (A. Stevension and C. Soanes, Eds.). Oxford: Clarendon Press.

Quirk, R., Greenbaum, S., Leech, G., & Svartik, J. (1985). A comprehensive grammar of the English language. London: Longman.

Quirk, C., Brocket, C., & Dolan, W. (2004). Monolingual Machine Translation for Paraphrase Generation. Proceedings of EMNLP-2004.