Latest Publications

Syntagmatic Patterns for Prepositions

The Pattern Dictionary of English Prepositions (PDEP) is modeled on the Pattern Dictionary of English Verbs (PDEV). PDEV is based on Corpus Pattern Analysis (CPA), a procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations. The focus of the analysis is on the prototypical syntagmatic patterns with which words in use are associated. In identifying such patterns, the goal is to delineate the relationship between two or more linguistic units to make well-formed structures. In the case of prepositions, this means examining words governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause.

The characterization of preposition behavior starts with the general syntagmatic pattern [[Governor]] preposition [[Complement]]. Each element of this pattern must be specified. We consider each component:

  • [[Complement]]: Syntactically, the complement is a noun phrase, a nominal wh-clause, or a nominal –ing clause. Considered by itself, the complement has a meaning, i.e., some ontological category. For example, Boston is a city. This category may frequently help in disambiguating the preposition. However, more generally, some additional meaning is given to the complement. For example, Boston may be a destination or a point of reference. The precise meaning will come from the preposition and the governor.
  • preposition: The preposition associated with the complement provides a first step in allowing us to determine what additional meaning should be added to the complement. In general, a given complement can appear after a large number of prepositions. For the example of Boston, we can imagine sentences using the following prepositions, across, against, around, beyond, from, in, into, of, over, through, to, and within. Other prepositions, such as between, by reason of, during, and until, are unlikely to have Boston as a complement. The specific preposition will impart some information on how we want to interpret the complement.
  • [[Governor]]: The final piece of meaning associated with the complement is provided by the governor, or the point of attachment, of the prepositional phrase. For the example of Boston, the verb played with against Boston will invoke a sports context, while resided with in Boston will invoke a locational sense.

In CPA, no attempt is made to identify the meaning of a preposition directly, as a word in isolation. Instead, meanings are associated with prototypical sentence contexts. Concordance lines are grouped into semantically motivated syntagmatic patterns. Associating a “meaning” with each pattern is a secondary step, carried out in close coordination with the assignment of concordance lines to patterns. In CPA, this meaning is expressed as a set of basic implicatures. As a first approximation, the implicature is a definition that might be found in a dictionary. In PDEP, the initial implicature (or definition) was taken from a dictionary and has been used as a first approximation for grouping concordance lines. However, since prepositions in general have not received as close attention as nouns and verbs, these definitions need to be viewed with a great deal of care.

In general, a preposition does not have meaning by itself. Instead, the meaning is conveyed by the totality of the pattern, and is distributed across the three components. For some prepositions, the bulk of the meaning is conveyed by the complement; with others, the bulk of the meaning is conveyed by the governor. There is a sliding scale of the contribution of each component; an interesting question is whether the relative contributions can be quantified in some way. Specification of a pattern will thus involve circumscribing the components in as much detail as is appropriate. For example, for about (1(1)), “on the subject of; concerning”, the complement may be specified as [[Anything]]; the governor emphasizes abstractions, communication, and mental features (feeling and idea).

Fulfilling the Firthian Maxim

J. R. Firth’s famous quotation, “You shall know a word by the company it keeps,” is cited as the beginning of corpus linguistics, the study of language as expressed in samples. This approach had great success in the growth of English lexicography. In 1990, the advent of computerized samples (corpora) brought about the emergence of a statistical approach to word behavior in computational linguistics, with the paper by Church & Hanks on word association norms and mutual information. As corpora have grown, so too has their analysis, particularly with word sketches, which provide a corpus-derived summary of a word’s grammatical and collocational behavior. Statistical characterizations of a  word’s behavior have found many uses, but here we want to focus on their use in lexicography. Word sketches have been used by lexicographers in developing definitions for dictionaries. Increasingly, they also keep a record of the sentences they use as the basis for each definition, i.e., the company that the word keeps. Such sentences can be viewed as sense-disambiguated, at least with respect to the sense inventory that has  been developed. With many sense inventories and their respective corpus instances, there is an opportunity for testing the consistency with which humans have classified the instances. Such consistency checking can be done both internally and across different resources. The emphasis of the consistency checking is on the “You” in Firth’s maxim. We explore how this can be accomplished.


Examining the Twitterverse with Content Analysis: A First Look

Analyzing Twitter data is becoming increasingly popular. Within the computational linguistics community, tweets are particularly challenging and interesting. The limitation of 140 characters would seem to make tasks easier, since sentences would be relatively short (e.g., compared to long sentences in newspaper articles). However, this limitation has brought with it some rather fundamental changes in the way we communicate, primarily in the lexicon, with novel creations (e.g., “l8” for “late”). In addition, tweets are full of non-standard use of punctuation marks, particularly in creating emoticons, further complicating analysis. A recent paper by Kyle Dent and Sharoda Paul, “Through the Twitter Glass: Detecting Questions in Micro-text“, took on the natural language processing (NLP) challenges (described briefly in a Scientific American article), developing NLP techniques to deal specifically with issues in tokenization, the lexicon, and parsing. They built a system to classify 2304 tweets into “real” questions and “not” questions (which had a superficial resemblance to questions). Tweets share a property with Likert scales, namely, that they are both short. The content analysis program MCCA (Minnesota Contextual Content Analysis) has been applied to an examination of Likert items in an attempt to improve the coherence of an entire scale. I modified MCCA slightly so that it would perform a classification task, applied it to the Twitter data used by Dent & Paul, and achieved results almost as good, without having to deal with all the NLP issues. This would suggest that MCCA can provide an initial classification tool as a first step in the analysis of Twitter data. The MCCA analysis also showed that the tweets in this data set are extremely emotional, anti-practical, and anti-analytic. (more…)

Enhanced Word Sketches

Recently, I made a request on the ACL SIGLEX mailing list for tools that might help in analyzing preposition lexical samples. In this request, I indicated a need for software that would specifically provide enhanced word sketch analysis. I only received a couple of replies, one of which asked what I meant by this term. I responded, with some vagueness, but the interchange sparked some thoughts that are worth exploring further. In particular, this discussion raised questions about the amount of information in preposition dictionary entries and what might help in expanding these entries. I’d like to expand on this, particularly on the relation between current approaches to word-sense disambiguation (primarily statistical in nature) and what ends up in the dictionary. I think there is still something of a disconnect between the computational community and the lexicographers.