Building on data from The Preposition Project (TPP), the Pattern Dictionary of English Prepositons (PDEP) is intended to identify the prototypical syntagmatic patterns with which prepositions in use are associated. By definition, PDEP seeks to identify linguistic units used sequentially to make well-formed structures and to characterize the relationship between these units. In the case of prepositions, the units are the complement (object) of the preposition and the governor (point of attachment) of the prepositional phrase. The relationship is usually called the semantic role, specifying the relationship that the prepositional phrase has with the main verb in a clause. This term is extended to include cases where the prepositional phrase modifies nouns or adjectives.
Standard dictionaries include definitions of prepositions, but they only loosely characterize the syntagmatic patterns associated with each sense. PDEP takes this a step further, looking for prototypical sentence contexts to characterize the patterns. PDEP is modeled on the principles of Corpus Pattern Analysis (CPA), developed to characterize syntagmatic patterns for verbs, which are viewed as central to expression of meaning. These principles are described more fully in Patrick Hanks (2013), Lexical Analysis: Norms and Exploitations. Currently, CPA is being used in the project Disambiguation of Verbs by Collocation (DVC) to develop a Pattern Dictionary of English Verbs (PDEV).
PDEP is closely related to PDEV. As indicated, most syntagmatic patterns for prepositions are related to the main verb in a clause. Because of this close relation, PDEP is viewed as subordinate to PDEV. This relationship is so close that the implementation of PDEP employs significant portions of the code being used in PDEV, with appropriate modifications as necessary to capture the syntagmatic patterns for preposition behavior.The Pattern Dictionary of English Prepositions is an online dictionary consisting of three main components: (1) a complete inventory of English single-word and phrasal prepositions, (2) a summary list of patterns for each preposition, with details for each pattern, and (3) actual corpus instances for each preposition, many of them sense-tagged and many available for analysis of the prototypical sentence contexts. The details of each component are described below to provide a user's guide for navigating and exploiting the PDEP data. PDEP is further described in a paper presented at ACL 2014; see the reference below for the full citation.
The start page for PDEP asks for the prepositions you want to see. You can enter a single preposition, the beginning letters of prepositions, or a regular expression (usually prefixed with '^' to indicate the beginning letters). Or, you can select an editing status, 'All' to retrieve all prepositions or another status to look at just those prepositions at some point in the process of being analyzed. (Currently, the only active status is 'initial'. Other statuses, to be used in the future are 'complete' (when all editing is done), 'ready' (indicating that everything has been done, but awaiting final review), WIP (work in progress), or VLF (very low frequency prepostions, for which there is likely not enough evidence for a definitive treatment).
The opening page of PDEP consists of a text box where you may enter a specific preposition, a drop-down list of "status" options (indicating the status of work on a preposition, with a default value of all to view all prepositions), and a button Load to load either a specific prepositions or a set. When the Load button is pushed, a list of the selected prepositions is shown.
The table of prepositions shows the status of the investigation into the properties for each preposition. The initial status indicates data that has been developed under The Preposition Project. The next column identifies the number of patterns associated with each preposition (this may also be viewed as the number of senses). The next two columns identifies the number of sentences (instances) that have been sense-tagged for each preposition from the FrameNet project or the Oxford English Corpus. The next three columns refer to sentences that have been gathered under TPP for analysis. The column labeled BNC Freq identifies the number of instances present in the written portion of the British National Corpus; this column thus describes the relative frequency with which each preposition occurs. The columns TPP Tagged and TPP Insts indicates the sample size that has been drawn from the BNC for this analysis. The number tagged indicates how many of the sample have been sense-tagged. The remaining columns of the table describe the editing that has occurred for the preposition.
The table of prepositions is sortable in each column, by clicking on the table headings. When you click on any row, a new tab is opened showing the patterns for the preposition.
The overall progress in tagging corpus instances is also shown at the bottom of this table. This line identifies the number of prepositions, the number of patterns, the number of FrameNet instances, the number of Oxford English Corpus instances, the number of TPP instances that have been tagged, the total number of TPP instances, and the estimated total frequency of prepositions in the written portion of the British National Corpus.
When you open a preposition, a new tab is opened with a title consisting of the preposition and its editorial status. This tab shows the current set of patterns for the preposition. Along the top, the number of tagged instances from each the TPP corpora is identified, along with the number of untagged TPP instances. The initial display shows a summary for each pattern, giving the pattern of the preposition in use, with a template for the general case, consisting of the string [[Governor]] preposition [[Complement]]. This is followed by the primary implicature for the pattern, essentially replacing the preposition with its definition. Associated with each pattern is a pattern number and the number of instances in each corpus that have been tagged with this pattern number.
Clicking on any pattern row opens the details for the pattern, with a pattern box entitled with the preposition and the pattern number. The pattern details provide descriptions of the complement and the governor, as written by a lexicographer, checkboxes identifying the basic syntactic characteristics for the complement and governor, and fields for recording selection criteria to recognize the pattern. A primary purpose of PDEP is to formalize these characterizations for use in natural language processing tasks (see below for procedures describing how the selectors are identified and encoded). The next two lines of the pattern detail gives the semantics and cluster/semantic relations expressed by this pattern. The TPP class and TPP relation identify the characterizations developed in TPP. The Cluster identifies the general cluster assigned by Stephen Tratz (A Fast, Accurate, Non-Projective, Semantically-Enriched Parser). The Relation identifies the general semantic relation assigned by Vivek Srikumar (Modeling Semantic Relations Expressed by Prepositions). These two fields were initially completed only for prepositions used in the SemEval 2007 task on preposition disambiguation, and are now being extended to cover all prepositions.
The remaining rows provide further insights into the paradigmatic and syntagmatic characteristics of the pattern. The list of Substitutable Prepositions identifies prepositions that have similar senses the one for this pattern. (Corpus instances of similar prepositons may provide useful information for further analysis.) The Syntactic Position identifies where in a clause the preposition in this pattern may appear, using the categories developed in Quirk et al. The Sense Relation identifies whether this pattern may be considered a core sense or a subsense of a core sense, in which case the type of relation is specified. Finally, the Primary Implicature is repeated and any comments about the pattern usage are specified.
In the menu bars for the pattern manager and for the pattern detail, there are drop-down boxes labeled All Corpus Instances and Corpus Instances. Selecting an option in either of these boxes will take you to the corpus instances associated with the preposition, either for the full set or for those that have been specifically tagged for a particular pattern. The options in the pattern manager refer to the full set of instances in the corpus for the preposition, regardless of the pattern or sense tag. The options in the pattern detail are for corpus instances that have been tagged with the specific pattern or sense. One option is for all patterns in the TPP corpus that have not yet been tagged (identified as "TPPUNK"). Another option is for all instances in the TPP corpus that have been tagged with the sense 'x', which identifies instances that are not valid for the preposition, usually reflecting instances that have been mistagged by the trawl in developing the TPP corpus. Another option is for all senses in the TPP corpus that have been tagged with the sense 'pv', where the instance is a (transitive) phrasal verb that uses the preposition form, but is really part of the verb unit. These latter instances provide a basis for studying tagging and parsing difficulties for the preposition.
The selected set of instances opens in a new tab titled Annotation: preposition (sense). Each sentence is accompanied by the name of the corpus and instance identifying number, along with the current sense tag and the location of the preposition in the sentence. In the sentence itself, the preposition is given in bold, highlighted in light blue, and labeled as the target. The preposition object (or complement) is given in bold, highlighted in light green, and labeled as the complement. The preposition point of attachment (or governor) is given in bold, highlighted in light orange, and labeled as the governor. (Note that not all complements and governors are properly tagged and labeled, due to some underlying difficulties, such as an inability by the parser to identify these items.) The primary purpose of this tab is to facilitate sense-tagging of the TPP corpus instances. The menu bar identifies the preposition, the sense, and the corpus.
Tagging instances involves first selecting instances (clicking on individual sentences selects the sentence or clicking Select All selects all sentences, with each selected instance highlighted in yellow) and then selecting an option, i.e., a sense, from the Tag Instances drop-down list. (Clicking Unselect will remove all selections.) In addition to the full set of pattern numbers for the preposition, the options include x (to indicate that this instance is not a preposition), pv (to indicate that this instance is really a transitive phrasal verb, where the lemma should be tagged as a particle, and not a prepositional phrase), and unk (for unknown, i.e., not yet tagged). The Save option is for registered editors and is used to commit taggings to the database.
Steps and aids using in tagging instances are described below. In addition to making use of the pattern descriptions, features identified in parsing all instances can be examined and used as the basis for selecting instances automatically. These features characterize the context of a preposition's use and provide links to FrameNet frame elements associated with FrameNet lexical units.
In characterizing preposition behavior, the general semantic content of each element of [[Governor]] preposition [[Complement]] must be specified. We consider each component:
In analyzing preposition behavior, therefore, the objective is to tease apart these various elements. The procedures for doing so are laid out below.
In general, tagging TPP instances is based on considering the pattern descriptions in the pattern manager. Since the pattern sets (definitions) are based on the Oxford Dictionary of English, the likelihood is that the coverage and accuracy of the sense distinctions is quite high. However, since prepositions have generally not received the close attention of words in other parts of speech, PDEP is intended to ensure the coverage and accuracy. During the development of the SemEval 2007 tagged instances, using FrameNet sentences, the lexicographer found it necessary to increase the number of senses by about 10 percent. Since the lack of coverage in FrameNet is well-recognized, the representative sample developed for PDEP should provide the basis for ensuring the coverage and accuracy of the sense inventory.
As indicated, the first step in tagging instances involves looking at the patterns and seeing whether the TPP instances can be tagged with existing patterns. In addition to the patterns, instances that have been tagged for SemEval 2007 (labeled FN) or the Oxford English Corpus (labeld OEC) can be opened and used as the basis for making judgments on the TPP corpus.
We have provided tools to enhance the examination of similarities from the FN or OEC corpora and applying the results to the TPP instances. As indicated, all sentences in the corpora have been fully parsed with a dependency parser. Features characterizing the context of the target preposition have also been developed for each sentence using Tratz' system. There are approximately 1500 features for each sentences; these data are almost instantly available for examination. When a particular corpus has been opened, whether for a particular sense or for the entire set, the menu bar includes an Examine item and a Select item. Next to the Examine item, there are two drop-down boxes, with the initial options labeled WFRs (word-finding rules) and FERs (feature extraction rules). To use the examine or select capability, a WFR and an FER need to be selected.
Word-finding rules enable examination of features for words in a certain contextual location with respect to the target preposition. They are divided into two sets: words pertaining to the governor and words pertaining to the complement. Words pertaining to the governor are: (1) verb or head to the left (l), (2) head to the left (hl), (3) verb to the left (vl), (4) word to the left (wl), and (5) governor (h). Words pertaining to the complement are: (1) syntactic preposition complement (c) and (2) heuristic preposition complement (hr). Thus, selecting one of these options identifies the word whose properties are to be examined.
Feature extraction rules identifies the specific kind of feature to be examined. There are 9 feature kinds: (1) part of speech, using the Penn Treebank categories (pos), (2) word class, the 4 major word classes (wc), (3) lexical name, the WordNet file name category, 27 possibilities for nouns and 15 for verbs (ln), (4) lemma, the base form of a word (l), (5) the word as it appears (w), (6) synonyms, as identified in WordNet (s), (7) hypernyms, the first level in WordNet (h), (8) whether the word is capitalized (c), and (9) affixes present in the word, a set of 27 suffix or prefix characteristics (af). Thus, the feature extraction rules enable examination of specific syntactic or semantic features of the selected word.
The combination of WFRs and FERs provide 63 features that can be examined for any corpus that is opened. When a WFR and an FER have been selected, clicking on Examine brings up a new tab with the results for that word/feature combination. The results are presented in a table with the headings Value, Count, and Description. Value gives the value of the feature. Count indicates the number of instances with this value. Description is given for only two features, the part of speech and the affixes, where the codes given in the value field are not always transparent. For the feature identifying whether a word is capitalized, the value is only 'true'. For most features, the number of possible values is relatively small, so the table is only several rows deep. For the lemma and the word itself, the number of distinct entries is limited by the number of instances in the particular corpus set being examined. For the synonym and hypernym features, the number of entries may be quite a bit larger.
In addition to the features that have developed through parsing the sentences in a corpus, an additional capability allows examination of potential semantic role labels using FrameNet data associated with lexical units (as annotated in the FrameNet project). Next to the drop-down boxes for specifying WFRs and FERs, there is a checkbox labeled FN when the given preposition has been used for marking a frame element. When frames are developed and sentences containing lexical units for the frame are annotated, a set of frame element realizations are recorded in summary form. Many of these realizations are in the form PP[prep]. We have created a dictionary of the FrameNet lexical units that contains a list of all frame element realizations associated with the lexical unit. Throughout the FrameNet data, 75 distinct prepositions are recorded along with the frame element. When the FN box is checked, for a particular corpus of a preposition, the set of lexical units with that preposition is retrieved. We hypothesize that the governor of a prepositional phrase is the trigger for this phrase. To examine the occurrences of a possible frame element governed by one of these triggers, we need to select the governor WFR (h) and the lemma FER (l. With this combination and with the FN box checked, clicking on Examine will generate a table of all governors (in the lemma form, i.e., lexical units) in the current corpus that have been tagged in FrameNet. In addition to the count of instances, the results also identify the set of frame elements that have assigned to these prepositional phrases in FrameNet under the Description heading. In many cases, more than one frame element has been tagged with the given lexical unit. For example, some sentences for the lexical unit dance have been tagged for the preposition 'across' with the Area or the Path frame element.
A similar capability has been added to examine prepositions identified in VerbNet. Throughout the VerbNet data, 31 distinct prepositions have been identified in VerbNet frames. Again, with the selection of the governor WFR (h) and the lemma FER (l), and with the VN box checked, clicking on Examine will generate a table of all governors (in the lemma form, i.e., members of VerbNet verb classes) in the current corpus that have been identified in VerbNet frames. In addition to identifying the lemmas, the results also identify the VerbNet classes. In some cases, a lemma may appear as a member of more than one verb class using the given preposition.
The general objective of examining features is to identify those that are diagnostic of specific senses. To do this most effectively, it is best to open the corpus instances that have been tagged with a specific sense in either FN or OEC (see the instructions above for Preposition Corpus Instances). Experience in examining features will identify the most useful combinations. When an interesting feature has been identified, it can be used to select sentences in the open corpus set. To do this, it is necessary to put the value identified in a feature examination in the box next to Select and then click on Select (or just pushing the Enter key after entering text in this field). When this is done on an FN or OEC corpus, particularly those for specific senses, the selected instances will generally show the consistency with which these instances have been tagged. When the same feature combination is used with the TPP corpus, particularly for instances not yet tagged, the selection will identify candidate instances for tagging with a specific sense. For example, opening the full TPP corpus for 'over', specifying 'hr' as the WFR and 'ln' as the FER, and then placing 'noun.time' in the selection box will identify 122 instances out of 500 that have this characteristic. Inspection will show how well this combination is diagnostic of sense 14(5) of 'over'.
By examining features, the behavior of a particular sense can be constructed. As indicated above, examining characteristics of the two tagged corpora (OEC and FN) will be useful in formalizing the TPP data in the pattern box. This may begin with an examination of the word classes (wc) and parts of speech (pos) of the complements and governors. These can be used to check the appropriate boxes in the pattern description (NN, NNP, WH, or -ING for the complements and Noun, Verb, or Adj for the governors).
A next step might be to examine the complement and governor lemmas (l) and words (w). It is likely that several words or lemmas will be identified. Several potential categorizations of these words can be examined, including WordNet lexical names (ln), WordNet synonyms (s), WordNet hypernyms (h), FrameNet frame element realizations (with FN checked), and VerbNet verb classes (with VN checked. When these features are examined, the results show the number of instances in the particular subcorpus and the total number of instances in that corpus, so that some assessment of generality can be made. The WordNet features tend to produce a larger number of total hits, reflecting the polysemy present in WordNet. The number of FrameNet and VerbNet hits are always below the total number of instances; this reflects the coverage of these two resources.
When some features appear to be diagnostic of a sense, the specifications can be applied to the TPP corpus using the Select facility. When the selected instances appear to have been selected appropriately, they can then be tagged with the particular sense under investigation. In such cases, the selection criteria are entered into the Selector fields of the patterns. For example, for pattern 12(10) of for, indicating the length of (a period of time, the WordNet lexical name noun.time is found to be quite prevalent in the OEC and FN corpora for this sense. When applied to the TPP corpus, most selected instances appear to be correctly identified. Upon examination, any incorrect selections can be unselected. The sense 12(10) is then applied to the selected instances. Finally, the annotation hr:ln:noun.time is entered into the Selector field for the complement.
Once instances in TPP have been tagged for a specific sense, the next time this sense is examined, these instances can then be investigated in further depth. It is much easier to examine the consistency of the tagging when only the instances with these tags are shown. Further shades of meaning can perhaps be identified, perhaps with further refinement of all fields in the pattern description.
It is worth noting that examination of WordNet, FrameNet, and VerbNet features may provide additional insights into those resources. The WordNet features frequently reveal unexpected characterizations (such as 'school' as a time period). For FrameNet, the FN corpus shows a very high number of hits for FrameNet head lemmas, while the OEC and TPP corpora show a much lower number of hits. VerbNet also has a much smaller number of hits. Thus, presuming that the identification of head lemmas is quite accurate, analysis of the TPP instances may provide an opportunity for expanding the coverage of FrameNet and VerbNet.
PDEP enables an indepth analysis of TPP classes, Tratz clusters, and Srikumar semantic realations. First, we query the database underlying the patterns to identify all senses with a particu-lar class. We then examine each sense on each list in detail. We follow the procedures laid out above for examining the features to add information about selectors, complement types, and categories. We use this information to tag the TPP instances, conservatively assuring the tagging, e.g., leaving untagged questionable instances. Finally, we carefully place each sense into a preposition class or subclass, grouping senses together and making annotations that attempt to capture any nuance of meaning that distinguishes the sense from other members of the class.
To build a description of the class and its sub-classes, we make use of the Quirk reference in the pattern box (i.e., the relevant discussions in Quirk et al. (1985)). We build the description of a class as a separate web page and make this available as a menu item in the pattern box, labeled Analysis. A class analysis is not yet available for all classes; the current state of class analysis is described in Preposition Class Analyses. The description provides an overview of the class, making use of the TPP data and the Quirk discussion, and indicating the number of senses and the number of prepositions. Next, the description provides a list of the categories within the class, characterizing the complements of the category and then listing each sense in the category, with any nuance of meaning as necessary. Finally, we attempt to summarize the selection criteria that have been used across all the senses in the class. A list of prepositions senses in each class and their semantic relation type (Srtype) is also provided, along with a count of the number of instances tagged with each sense, the percentage of instances for the preposition that have been tagged with each sense, and a normalized frequency of the occurrence of each sense in the British National Corpus (per million prepositions).
The process of building a class description reveals inconsistencies in each of the class fields. When we place a preposition sense into the class, we may find it necessary to make changes in the underlying data. At the top level, these class analyses in effect constitute a coarse-grained sense inventory. As the subclasses are developed, a finer-grained analysis of a particular area is available. We believe these analyses may provide a comprehen-sive characterization of particular semantic roles that can be used for various NLP applications.
PDEP is a work in progress, with several questions being addressed, including the following: