The Clog

Syntagmatic Patterns for Prepositions

admin — Fri, 01 Jul 2016 02:14:55 +0000

The Pattern Dictionary of English Prepositions (PDEP) is modeled on the Pattern Dictionary of English Verbs (PDEV). PDEV is based on Corpus Pattern Analysis (CPA), a procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations. The focus of the analysis is on the prototypical syntagmatic patterns with which words in use are associated. In identifying such patterns, the goal is to delineate the relationship between two or more linguistic units to make well-formed structures. In the case of prepositions, this means examining words governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause.

The characterization of preposition behavior starts with the general syntagmatic pattern [[Governor]] preposition [[Complement]]. Each element of this pattern must be specified. We consider each component:

[[Complement]]: Syntactically, the complement is a noun phrase, a nominal wh-clause, or a nominal –ing clause. Considered by itself, the complement has a meaning, i.e., some ontological category. For example, Boston is a city. This category may frequently help in disambiguating the preposition. However, more generally, some additional meaning is given to the complement. For example, Boston may be a destination or a point of reference. The precise meaning will come from the preposition and the governor.
preposition: The preposition associated with the complement provides a first step in allowing us to determine what additional meaning should be added to the complement. In general, a given complement can appear after a large number of prepositions. For the example of Boston, we can imagine sentences using the following prepositions, across, against, around, beyond, from, in, into, of, over, through, to, and within. Other prepositions, such as between, by reason of, during, and until, are unlikely to have Boston as a complement. The specific preposition will impart some information on how we want to interpret the complement.
[[Governor]]: The final piece of meaning associated with the complement is provided by the governor, or the point of attachment, of the prepositional phrase. For the example of Boston, the verb played with against Boston will invoke a sports context, while resided with in Boston will invoke a locational sense.

In CPA, no attempt is made to identify the meaning of a preposition directly, as a word in isolation. Instead, meanings are associated with prototypical sentence contexts. Concordance lines are grouped into semantically motivated syntagmatic patterns. Associating a “meaning” with each pattern is a secondary step, carried out in close coordination with the assignment of concordance lines to patterns. In CPA, this meaning is expressed as a set of basic implicatures. As a first approximation, the implicature is a definition that might be found in a dictionary. In PDEP, the initial implicature (or definition) was taken from a dictionary and has been used as a first approximation for grouping concordance lines. However, since prepositions in general have not received as close attention as nouns and verbs, these definitions need to be viewed with a great deal of care.

In general, a preposition does not have meaning by itself. Instead, the meaning is conveyed by the totality of the pattern, and is distributed across the three components. For some prepositions, the bulk of the meaning is conveyed by the complement; with others, the bulk of the meaning is conveyed by the governor. There is a sliding scale of the contribution of each component; an interesting question is whether the relative contributions can be quantified in some way. Specification of a pattern will thus involve circumscribing the components in as much detail as is appropriate. For example, for about (1(1)), “on the subject of; concerning”, the complement may be specified as [[Anything]]; the governor emphasizes abstractions, communication, and mental features (feeling and idea).

Fulfilling the Firthian Maxim

Ken — Wed, 30 May 2012 20:00:34 +0000

J. R. Firth’s famous quotation, “You shall know a word by the company it keeps,” is cited as the beginning of corpus linguistics, the study of language as expressed in samples. This approach had great success in the growth of English lexicography. In 1990, the advent of computerized samples (corpora) brought about the emergence of a statistical approach to word behavior in computational linguistics, with the paper by Church & Hanks on word association norms and mutual information. As corpora have grown, so too has their analysis, particularly with word sketches, which provide a corpus-derived summary of a word’s grammatical and collocational behavior. Statistical characterizations of a word’s behavior have found many uses, but here we want to focus on their use in lexicography. Word sketches have been used by lexicographers in developing definitions for dictionaries. Increasingly, they also keep a record of the sentences they use as the basis for each definition, i.e., the company that the word keeps. Such sentences can be viewed as sense-disambiguated, at least with respect to the sense inventory that has been developed. With many sense inventories and their respective corpus instances, there is an opportunity for testing the consistency with which humans have classified the instances. Such consistency checking can be done both internally and across different resources. The emphasis of the consistency checking is on the “You” in Firth’s maxim. We explore how this can be accomplished.

A significant open problem in computational linguistics is word-sense disambiguation (WSD). As currently formulated, this task is the problem of assigning senses to a corpus of instances, given a sense inventory. Difficulties noted in the literature include differences among dictionary sense inventories, the granularity of the senses, and disagreements among annotators in making sense assignments. What we want to do is to address these difficulties in a more systematic fashion by turning the WSD task around. Specifically, we want to examine the corpus instance classifications to determine the extent to which the lexicographers have been consistent in characterizing the company a word keeps.

To perform this task, we use the following steps:

Extracting the instances from a resource’s data for each sense, keeping track of whatever properties the resource uses to characterize a sense and collecting all the examples that have been associated with the sense,
Tagging the instances to obtain a set of part-of-speech tags and the lemmas for each token, doing so with a single tagger so that results across resources can be compared,
Analyzing the tags to locate the target word and to identify other phrases that stand in particular syntactic structure to the target (and identifying the heads of such phrases), and
Comparing two resources to determine the extent to which the profile of the analyses for each sense corresponds to each other, over all senses.

The last two steps form the essence of satisfying the Firthian maxim, characterizing the company a word keeps. Clearly, both steps can become quite involved. We have only scratched the surface in our development of appropriate methods, hopefully at least providing a proof of concept that this approach will be useful.

Our analysis has involved several prominent lexical resources: the Oxford Dictionary of English (ODE) sentence dictionary, the Pattern Dictionary of English Verbs (PDEV), the Dictionary of Analysed Texts of English (DANTE), FrameNet (FN), and WordNet (using SemCor). We have only examined one word, abandon, in its verb senses, and have thus far only used one criterion, its object, as the basis for the tag analysis and the resource comparison. These resources have the following properties:

ODE: 7 senses, 118 sentences
PDEV: 7 senses, 228 sentences
DANTE: 9 senses, 50 sentences
FN: 3 senses, 20 sentences
WN: 5 senses, 19 sentences

Each sense in each resource is identified as having a noun phrase object. However, the object is not always immediate, since the verb in the corpus instances is frequently in the passive voice (where the surface subject is actually the object of the verb) or used as a past participle modifying a noun (taken to be the object). The tag analysis attempts to find these objects, and in our preliminary implementation, succeeded in about half the cases. We used the lemma corresponding to the head of the noun phrase as the basis for comparing the five resources. The comparison looked at two resources at a time, arraying the senses of each with the other, counting the number of heads in common for each cell in the matrix.

Our analysis yielded several observations:

Lexicographically well-drawn sense inventories (ODE, PDEV, and DANTE) were generally consistent with each other, with common heads intersecting in a single sense.
When sense inventories differing in size were mapped (e.g., PDEV and DANTE), common heads intersected in more than one sense, suggesting that the inventory with the larger number of senses had “split” senses that had been “lumped” by the other inventory (i.e., addressing the issue of sense granularity).
Some mapping revealed inconsistencies. For example, in FN, corpus instances including “abandon a project” appeared under multiple senses, suggesting a violation of the Firthian maxim.
Some tag analysis showed internal inconsistency within a single sense inventory, e.g., “abandon a plan” appeared under two senses in PDEV.

These observations are only preliminary, and clearly need further substantiation with more words and with more tag analysis. However, they are intriguing, since they are supportive of the Firthian maxim. They also support basic lexicographic notions of lumping and splitting. Finally, the observations give some understanding of the potential source of difficulties in WSD.

Examining the Twitterverse with Content Analysis: A First Look

Ken — Tue, 27 Sep 2011 20:38:26 +0000

Analyzing Twitter data is becoming increasingly popular. Within the computational linguistics community, tweets are particularly challenging and interesting. The limitation of 140 characters would seem to make tasks easier, since sentences would be relatively short (e.g., compared to long sentences in newspaper articles). However, this limitation has brought with it some rather fundamental changes in the way we communicate, primarily in the lexicon, with novel creations (e.g., “l8” for “late”). In addition, tweets are full of non-standard use of punctuation marks, particularly in creating emoticons, further complicating analysis. A recent paper by Kyle Dent and Sharoda Paul, “Through the Twitter Glass: Detecting Questions in Micro-text“, took on the natural language processing (NLP) challenges (described briefly in a Scientific American article), developing NLP techniques to deal specifically with issues in tokenization, the lexicon, and parsing. They built a system to classify 2304 tweets into “real” questions and “not” questions (which had a superficial resemblance to questions). Tweets share a property with Likert scales, namely, that they are both short. The content analysis program MCCA (Minnesota Contextual Content Analysis) has been applied to an examination of Likert items in an attempt to improve the coherence of an entire scale. I modified MCCA slightly so that it would perform a classification task, applied it to the Twitter data used by Dent & Paul, and achieved results almost as good, without having to deal with all the NLP issues. This would suggest that MCCA can provide an initial classification tool as a first step in the analysis of Twitter data. The MCCA analysis also showed that the tweets in this data set are extremely emotional, anti-practical, and anti-analytic.

MCCA is a content analysis program designed to characterize texts based on the relative frequency with which words in categories are used, compared to norms determined from general usage statistics for the English language. It has been used in over 1500 studies since the early 1970s, primarily using a mainframe program at the University of Minnesota (with statistics on results used to determine any trends in general usage). The texts can range in size from short answers given to open-ended questions in questionnaires, newspaper articles, books, and multi-person transcripts such as focus groups or plays. MCCA takes about 4 seconds to analyze the 30,000 words in Hamlet. There are two primary sets of statistics used to characterize texts: (1) emphasis scores, showing the relative frequency of words in 116 categories (such as Feeling, Quantities, Spatial Sense, and Human Roles) and (2) context scores, profiling texts along four social context dimensions. The dimensions are traditional (judicial or religious texts), practical (newspaper articles, goal-oriented how-to texts), emotional (focus on personal involvement, such as leisure or recreation), and analytic (objective, research-oriented texts). The contextual analysis is a distinguishing characteristic of MCCA, determined from a principal components analysis of texts and each of which has a set of weights for each of the 116 emphasis categories. Typically, multiple texts are processed together to permit a comparison among them.

Underlying MCCA is a dictionary of 11,000 words, each of which has been assigned one or more categories (i.e., allowing ambiguity). As a text is processed, words with multiple categories are disambiguated using a running context score (so that the selected category is closest to the running context). The various statistics characterizing texts are essentially based on the words that can be categorized. Unknown words are relegated to a “leftover” category and generally do not participate in the analyses. Two principal statistics are distance matrices, one for emphasis scores and one for context scores, allowing an examination of the distances among the texts being analyzed (or the characters in a play).

As currently implemented, MCCA is not a classifier. However, it was straightforward to modify it so that it can be used as a classifier. To do this, a file containing multiple “texts” is processed to serve as the reference or marker set. (In the case of the Twitter data, the file was divided into two texts, one for instances deemed to be real questions and one for instances deemed not to be real questions. See the Dent & Paul paper for a more complete description of these two sets.) Then, instances to be classified are processed one by one, with each instance first analyzed as an ordinary text, with emphasis and context scoring, and next compared to the reference texts, using the nearest distance as the criterion for the classification.

In this first look at the Twitter data, we simply put the two files (questions.txt and notquestions.txt) into one file, with separators to reflect the two sets. We then processed this file (in about 4 seconds). The size of this file is comparable to Hamlet (about 30,000 tokens), about 10 times as large as a smaller demonstration file of five texts. The first observation about the data is the percentage of words that could be categorized. For the demonstration file, about 90 percent of the words were classified; this is roughly what occurs for modern texts. For Hamlet, 83 percent are classified; this reflects the change in English over 400 years. For the Twitter data, only 77 percent of the words were classified; this is a clear indication that, with Twitter, a significant change in the language is occurring.

I next examined the various statistics generated by MCCA to attempt to discern any differences between the Questions and ¬Questions. There were non-zero distances between the two sets for both emphasis scores and context scores; it was not immediately clear whether these differences were important. One of the result sets is a difference analysis that shows the emphasis categories that are most different between the two sets. In this case, two categories looked most different: Move-in-Space (forward, close, side) and Who-Where (who, which, someone, something); these looked interesting.

The next step was to classify the instances. In this initial examination, I used the full Twitter data as the reference set and then classified each instance. I did not create a subset of the data to use as a “training set”, against which to classify the remaining instances as the “test set”. This would have been more rigorous, but I’m not sure that it would have been necessary in this first look. In this first test, I used the emphasis score distance as the criterion for classification. The results are shown in the following table:

MCCA Results	MT Questions	MT ¬Questions
MCCA Question	708	433
MCCA ¬Question	444	717

For comparison, the Dent-Paul results are shown the following table:

Dent-Paul Results	MT Questions	MT ¬Questions
Parser Question	898	486
Parser ¬Question	254	666

The MCCA results have a precision of 0.62050, a recall of 0.61458, and an accuracy of 0.61902. These compare with the Dent-Paul results, with a precision of 0.64484, a recall of 0.77951, and an accuracy of 0.67881.

My conclusion is that the MCCA results are quite comparable and were achieved with much less effort. I performed the same test using the context score distance, the five categories with the highest differences, and the single Who-Where category. In all these cases, the results were not as good, with only the top five categories achieving an accuracy of 0.61033, but with much lower precision. I’m not sure if any better results can be achieved with MCCA. It’s possible that use of various machine learning classifiers might optimize the results, but I don’t think this would be worth the effort on this data set, which may not be typical of Twitter data. I think this classification task is very difficult, particularly since the principal criterion for selecting tweets in the Dent & Paul study was the presence of a question mark.

One very interesting aspect of the Twitter data is the overall contextual characterization of the two sets. As indicated above, the reference file was separated into two sets: Questions and ¬Questions. One of the statistics produced by MCCA is a table of the weighted context scores. Each context is normalized on a 50 point scale, from -25 to +25. The results in this case are shown in the following table:

Text Group	Traditional	Practical	Emotional	Analytical
Average	0.70	-10.35	24.30	-14.65
Question	1.04	-9.46	23.96	-15.54
¬Question	0.42	-11.07	23.96	-13.93

In papers available through the MCCA link above, the point is made that in analyzing texts, the social context scores are almost never pure. Thus, these context scores are very surprising. They suggest that, at least for this set of Twitter data, the instances are almost purely emotional, with a strong anti-practical and anti-analytical bias, and a neutral score on traditional values. These results raise the question of whether this data is representative of the question universe and whether real questions that, for example, might be asked in a more practical and analytical context are being missed. Notwithstanding, the results make further examination of Twitter data in different contexts (e.g., in dealing with natural disasters or in situations like the Arab spring) would show different profiles. Clearly, this kind of analysis might prove to be very useful and interesting.

Enhanced Word Sketches

Ken — Thu, 10 Mar 2011 19:19:20 +0000

Recently, I made a request on the ACL SIGLEX mailing list for tools that might help in analyzing preposition lexical samples. In this request, I indicated a need for software that would specifically provide enhanced word sketch analysis. I only received a couple of replies, one of which asked what I meant by this term. I responded, with some vagueness, but the interchange sparked some thoughts that are worth exploring further. In particular, this discussion raised questions about the amount of information in preposition dictionary entries and what might help in expanding these entries. I’d like to expand on this, particularly on the relation between current approaches to word-sense disambiguation (primarily statistical in nature) and what ends up in the dictionary. I think there is still something of a disconnect between the computational community and the lexicographers.

Adam Kilgarriff developed the word sketch engine, which has found great use among lexicographers. The main novelty of word sketches is the reliance on tagged corpora, in which each word receives a part of speech tag. Of particular note is that these tags constitute the terminals in a parse tree. However, the full parse tree showing constituents is not provided. Instead of a parse tree, some chunking of the terminals is performed, primarily using the Corpus Query Language. The sketches generate very useful information, including many syntactic relationships, through this bottom-up approach. To me, there are two major problems: (1) there is no constituent analysis and (2) there is no semantic analysis. Patrick Hanks has extended word sketches somewhat in what he’s doing with Corpus Pattern Analysis. The focus of this analysis is on the identification of syntagmatic patterns (for verbs only). This pattern analysis has a strong affinity to FrameNet, but crucially adds some semantic characterization to the elements of the patterns using a shallow ontology. The development of these patterns is a labor-intensive effort, involving the development of patterns and the tagging of corpus instances with a pattern number until all instances (a sample if there is a large number) are completed. Currently, there is no automatic or semi-automatic tool to facilitate this process.

On the other side of the equation, the computational community has made substantial progress in identifying features useful in word-sense disambiguation (WSD). I’ll stick here to work in preposition WSD, since there is a smaller literature, but will sufficiently illustrate the process. The major contributions stem from the preposition disambiguation task of SemEval-2007 and a comprehensive treatment by O’Hara & Wiebe (2009). In SemEval-2007, there were three particpants: Ye & Baldwin, Yuret, and Popescu et al. Hovy et al. built upon these results to make further advances in preposition disambiguation. While these efforts have identified and refined the set of features useful in preposition disambiguation, up to 85 percent accuracy, I don’t think they have fully explored the potential feature space. To consider how enhanced word sketches might contribute further, we need to examine in detail what these features are and how they correspond to what’s in a dictionary.

Collocation features (the context) are the most important in preposition disambiguation. While the earlier studies focused on context windows, Hovy et al. found that the governor (the word to which a prepositional phrase is attached) and the object are of key importance, both of which are generally found within context windows, with the governor of greater importance than the object. Dictionaries do not generally provide any information about (the class of) the governor. One exception to this may be found in definitions of the preposition of (see the online preposition project data), where several senses characterize the governor. Conversely, the definitions of many verbs and nouns will identify, explicitly or implicitly, an association with a specific preposition. For example, move (go in a specified direction) links well with a sense of to (expressing motion in the direction of (a particular location)); this is the kind of linkage (chain-clarifying relationship) investigated in Popescu et al. More preposition definitions characterize the preposition object, although some specify the semantic role of the object rather than properties of the object itself.

In the several studies, syntactic and semantic features are determined to be of less importance. However, a significant problem with this conclusion is the issue of how well available tools characterize these features. In these studies, semantic characterizations have examined only WordNet-based features. Since WordNet makes no claims about semantic classes, this conclusion must be held in abeyance.

The Preposition Project (TPP) has characterized many properties of each preposition sense. These have not been fully investigated in the several studies. TPP labels each sense according to its Quirk syntax; Hovy et al. used “fronting” (capitalization) as a feature; such a feature could be important for some prepositions but not for others. TPP identifies FrameNet frames and frame elements associated with each sense (based on the available corpus); these constitute an additional type of semantic characterization, possible relevant features that could be investigated. TPP also identifies other prepositions that can substitute; these were used by Yuret, who found that substitutions, while useful for disambiguation, did not work as well as for verbs and nouns. Potentially, these substitutions could be examined in conjunction with the preposition classes built from the TPP data; analysis of disambiguation properties by class has not yet been investigated. To some extent, coarse classes have been investigated through use of Penn Treebank (PTB) data; however, since prepositions were not accorded much prominence in PTB, further study may be warranted along these lines. O’Hara & Wiebe and Hovy et al. both reported good results

In summary, what I’d like an enhanced word sketch to do is to enable me to tag (like a lexicographer), build syntagmatic patterns for each sense, and ultimately to be able to assign (I think) definitive frame elements to each sense. I’d want to put this into a dictionary in such a way that we could have something like a decision tree to identify the appropriate sense. I’m hoping that such a decision tree would greatly facilitate building an appropriate representation of which the PPs are placed in their proper subsidiary roles.

Semantic Primitives

Ken — Sat, 15 Jan 2011 22:30:12 +0000

In a recent posting to CORPORA on the topic of semantic primitives, John Sowa says,

The so-called primitives are the result of analysis by adults who have learned how to write dissertations about language. I believe there are no primitives that are truly primitive in the sense that they cannot be analyzed in different ways by different adults with different biases.

While I won’t argue with John, I do believe such statements can have a discouraging effect on useful research. Throughout the 1970s and 1980s, research on machine-readable dictionaries (MRDs) was quite the rage. However, in 1991, Jean Veronis and Nancy Ide wrote a paper, “An Assessment of Semantic Information Automatically Extracted from Machine Readable dictionaries.” They concluded that 55 to 70 percent of the data was garbled in some way. This paper had a similar discouraging effect on MRD research. I have been engaged in MRD research for 40 years and would like to suggest that the search for primitives is not without value.

In general, I agree with John’s assessment of research on primitives. However, I think the one major reason for this is that such work tends to be very a priori, i.e., the result of people thinking about what should constitute a set of primitives. I agree that these various efforts are not very convincing. I have been critical of 1000 words of Basic English and the use of a defining vocabulary of 2000 words for the Longman’s dictionaries. I have been skeptical about Wierzbicka’s primitives and the WordNet tops because they were not derived from evidence. I am very skeptical about the root nodes in ontologies (all the rage these days).

In late 2001 and early 2002, I was fortunate to do some work for the Oxford Dictionary of English (ODE) in identifying what was called superordinates of noun definitions. At the time, Oxford had developed a noun hierarchy that was based in part on WordNet. I developed routines to parse these definitions and to generate these superordinates. I estimated that my work was about 85 percent correct and I was allowed to proceed. My exercise was a starting point for the Oxford lexicographers, who then following in behind me to correct my work and to use their expertise to fill in what I had not been able to accomplish. I believe the quality of this exercise is now visible in their new online dictionary (which includes large sets of sentences from the Oxford English Corpus). This is certainly not the ultimate in identification of primitives, but it has the significant advantage that it is data-driven, i.e., a posteriori.

In addition to this exercise, I have made frequent use of digraph analysis of sets of definitions in an attempt to find primitives. I have done this for prepositions (preposition classes in the preposition project), FrameNet’s frame elements, and 13,000 verbs in the Macquarie dictionary. Certainly not a be-all and end-all, but helpful in attempting to make sense of our wonderful language.

In some recent conversations on noun compounds with Robert Amsler (whose verb taxonomy derived from Merriam-Webster’s pocket dictionary set the tone for MRD research at the end of the 1970s), he has talked about how he views the potential of the latest Google corpus for his efforts. Specifically,

The simple rule that we accept in chemistry, that a compound may not exhibit any of the properties of its component elements, doesn’t seem to be understood as essentially true in the lexicon as well.

I think the important point here is that, hopefully in some scientific manner, we can proceed to analyze these sorts of phenomena. I hope we don’t take John’s words to stifle such research.

Semantic Analysis

Ken — Sat, 04 Dec 2010 21:02:45 +0000

I recently developed an overview of the tasks in SemEval (the series of semantic evaluations conducted under the auspices of the ACL SIGLEX). The nice thing about this exercise was that it put semantic analysis into a larger perspective, where it becomes clearer where things are lacking. The overview groups the tasks into dictionary issues and issues involving how sentence and textual elements fit together, the fruits of which are then available for application areas. After the first Senseval (the precursor to SemEval) was conducted, with a focus on word-sense disambiguation (WSD), the question was raised as to what purpose WSD served. The same question can be asked about all the other tasks. Attempting to answer this question may help to identify needed further tasks in SemEval, but also may help to identify how the various pieces of information may be used in different application areas. In what follows, I offer some opinions, particularly trying to identify other research that is relevant to the SemEval tasks.

In the area of dictionary issues, there are several holes. The main hole is that dictionary entries still do not contain the necessary information to enable disambiguation among multiple senses. Over the past 20 years, corpus linguistics has become the sine qua non among lexicographers. This has revolutionized the construction of dictionary entries, primarily due to the current reliance of corpus evidence, rather than made-up examples. With this use of corpus evidence, entries also increasingly attempt to characterize the constructions in which the entries appear (see The Oxford Guide to Practical Lexicography, Atkins & Rundell, 2008). However, these characterizations seem only to contain syntactic information and little semantic information. There are some selectional restrictions or selectional preferences, but these are only minimal. The work of Patrick Hanks, in corpus pattern analysis, attempts to rectify this shortcoming, and indicates how much work needs to be done in this area. Corpus evidence is also being used increasingly to characterize the collocational patterns of entries (see particularly DANTE). Now, in saying that this is insufficient to enable disambiguation, the proof of the pudding is that these characterizations do not yet fully incorporate the features that have been found useful in various WSD systems, particularly supervised systems, where investigators have tried a considerable panoply of textual attributes. The work of these researchers has not yet been incorporated into lexical databases. A second major hole in dictionaries is that entries do not contain a representation of the content of a sense that can be used to build a larger representation of a text. This hole leads into the second area in which SemEval tasks are focused.

Many SemEval tasks attempt to characterize how sentence and textual elements fit together, once some initial syntactic processing has been completed. In these tasks, an attempt is made to characterize chunks of text in a sentence and beyond (e.g., in “full texts”). These tasks include such things as semantic role labeling, semantic relation analysis, and coreference resolution. Now, what we’d really like here is a contribution from the lexicon, i.e., a representation that can be plugged into these analyses. There has been some beginning of this task with frame semantics, via the FrameNet project, where many lexical units trigger a frame consisting of frame elements, and where the frame definition and frame elements may be viewed as definitional in form. With its foray into full-text analysis, there is some beginning of intersentential relations. Another useful formalism is the lexicon development environment for use with unification-based linguistic formalisms, e.g, the LKB system, which incorporates lexical items in HPSG systems. Rhetorical Structure Theory provides another way of examining a text in its totality, but this theory has not been developed much of late. Importantly, as John Sowa points out in The Role of Logic and Ontology in Language and Reasoning,

Forty years of research in logic, linguistics, and AI has not produced a successful implementation: no computer system based on that approach can read one page of a high-school textbook and use the results to answer the questions and solve the problems as well as a B student.

Semantic analysis plays a large role in what may be considered its ultimate application areas, such as information extraction, question answering, document summarization, machine translation, paraphrasing, and recognizing textual entailment (RTE). The contributions of semantic analysis are difficult to assess in these tasks. Each has developed its own methods and there doesn’t seem to be any overarching analysis that identifies the specific contribution of semantic components. In many of these areas, investigators have begun to perform ablation analyses that seek to identify the relative contributions of its components. In RTE, the situation has become somewhat dire, where investigators do not have a clear idea of how results are being achieved. Sammons et al. (2010), in “Ask Not What Textual Entailment Can Do for You”, have proposed a community-wide effort to annotate RTE examples with the inference steps required to reach a decision about the example. This indicates the scale of the effort.

Lexicologic Insights from Cognitive Neuroscience

Ken — Sat, 20 Feb 2010 21:17:56 +0000

The Number Sense: How the Mind Creates Mathematics (1999) and Reading in the Brain (2009), by Stanislas Dehaene, provide insights that can aid in the construction of computational lexicons. Dehaene describes how both reading and mathematics recruit structures of the brain that evolved for other purposes (the neuronal recycling hypothesis). There is a visual recognition process that progressively extracts graphemes, syllables, prefixes, suffixes, word roots, and numbers. After this process, two routes in parallel activate speech creation and look-up in a mental lexicon. For both reading and mathematics, the processes are different from the computational processes implemented in computers (e.g., mathematical algorithms and parsing). Rather than attempting to optimize computational mechanisms for such processes, we can take a slightly different route by following the steps used by the brain to perform these tasks, i.e., accessing fragments of meaning in the mental lexicon.

Dehaene suggests that the number sense can be viewed as a sense just like the sense of smell or taste and can be found in many other animals besides humans. Although the number sense gave rise to mathematics, it is crucially different from the rigor and logic of mathematics; that is, the mind-computer metaphor does not hold up. If we look at how mathematics is performed, from a cognitive neuroscientific perspective, we can observe a combination of serial and parallel processing. This processing does not mirror the kind of processing used in looking up a word in an alphabetic dictionary and obtaining a definition. If we take this alternative route, we will not follow the logic of such things as ontologies, which have the same fatal flaws as the logical incompleteness of mathematics.

Dehaene describes the use of electroencephalograms (EEGs) to discern the order in which brain regions are activated. He particularly examines mathematical tasks, such as comparing two numbers to see which is larger. The main steps are planning, sequential ordering, decision making, and error correction; these steps are under the control of “executive areas” of the brain, which calls the necessary modules into play.

Dehaene’s methods provide the following general scenario: Some task is to be performed and different modules in the brain are activated in some order. Ten or twenty areas in the brain are activated in tasks such as reading words, examining their meaning, viewing a scene, or performing a calculation. Each region performs an elementary operation such as constructing a pronunciation or identifying the part of speech. The modules are generally very specific and very fine-grained. Our task is to identify and characterize the distinct fragments of meaning. Initially, we do not know exactly what fragments there are and we run the risk of imposing preconceptions into the process.

Dehaene suggests that reading first involves a “letterbox” which performs some recognition of a word (graphemes, syllables, prefixes, suffixes, word roots, and numbers). In examining the processing of strings, Dehaene identified a top-down sequence in assessing the meaning of a word. He used the strings EIGHTEEN, EINSTEIN, EXECUTE, and EKLPSGQI. Initially, visual areas are activated and then by a quarter-second, actual words were discriminated from meaningless words. At this point, activations differed for number words, proper nouns, or verbs and other words. There was a difference in response for major categories, i.e., access to the meaning of a word. In The Number Sense, Dehaene in 1997 summarizes these findings to indicate that the full characterization of all these nuances was only beginning; he noted that the number of questions could go on and on.

After presenting the details of observations, Dehaene goes on to consider the nature of mathematics in light of these discoveries. He concludes first that the brain-computer metaphor is not a good model of the data: the brain is not a logical machine. Dehaene goes on to examine axiomatic systems in mathematics (e.g., attempts by logicians such as Peano, Frege, and Russell to build a consistent basis for mathematics), leading up to Gödel’s Incompleteness Theorem. He concludes that mathematics has been subject to evolution, with increasing efficiency in its ability to express mathematical ideas. He suggests that this results from mathematical intuitions (e.g., Chinese words for numbers are much shorted to pronounce than English numbers, leading to greater efficiency in making simple calculations). The point of all this is that it is very easy to get hung up on and locked into mathematical formalisms, which may become problematic because of some ultimate inconsistency.

Mathematicians, particularly around the early 1900s, were very keen on building a logical structure for mathematics. Most notable was Bertrand Russell’s Principia Mathematica. These efforts were derailed by Gödel’s incompleteness theorems, which carried over to Turing machines and the advent of the digital computer. Lessons from these efforts should carry over to those who are attempting to develop ontologies. They will always be incomplete. In addition, the efforts to develop ontologies may obscure important aspects of our attempts to build dictionaries. They are focused too much on hierarchical representations (i.e., following the hypernymic backbone) and do not take into account all the many activations that may occur when we are confronted with bringing to bear knowledge about a word. (See the Suggested Upper Merged Ontology, SUMO, the Cyc ontology, WordNet, and the Semantic Web.)

The problem is that there are too many pieces of information associated with a word: all the context that needs to be brought to bear (i.e., corpus linguistics), syntactic knowledge, semantic knowledge, relations with other parts of the lexicon, culture, etc. In the parser I use, the lexicon is designed for rapid access to syntactic information. It uses a hashing technique to access a word (i.e., it does not proceed by alphabetic lookup) and stores the word’s information in lists. These lists are nested, with syntactic categories at the first level and possibly other information as sublists providing limited amounts of context, subcategorization patterns, or various irregularities.

There is a growing field of computational neurolinguistics (see upcoming workshop), as well as attention being paid to optimal organization of the lexicon (another upcoming workshop). At the moment, it seems that cognitive neuroscience is focused primarily on comparing models of the neural activity with various language resources. Important studies in this area include Mitchell et al. (2008) and Murphy et al. (2009). The latter study particularly makes use of EEG data, but it was designed primarily to determine whether a priori semantic features were correlated with activation of particular brain regions. There are many fragments of meaning associated with individual words, so this kind of study is only a first step.

As I continue to investigate developments in these areas, I will be attempting to identify mechanisms that can be used in the design of computational lexicons. As an example of this, consider the first step of recognizing words, the hashing step I mentioned above. In my parser, the look-up phase computes a hash value for each word and accesses the location of its definition in the dictionary. The first step is to create an intermediate memory of all the parts of speech associated with the word, including the possibility that the string is merely a meaningless string and not a word at all. An important question is whether this is the most efficient access. It is this kind of question that will be informed by findings in cognitive neuroscience. I will draw upon these findings in later posts.

Taxonomy Change Operations

Ken — Fri, 29 Jan 2010 16:28:40 +0000

I have been involved in the development of a frame element hierarchy or taxonomy, based on FrameNet’s frame-to-frame relations and frame element definitions. Since I know that this taxonomy is not perfect and can be improved, I need to consider the types of operations that might be involved in making changes. Although this may seem a trivial task, a substantial amount of rigor needs to be maintained. Many other systems (particularly ontologies) also involve some sort of hierarchical relationships, principally the ISA relationship. The operations I consider will embrace these as well.

A taxonomy is a system of classification. (See the definitions returned by Google.) An ontology is very similar, “a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical“. A taxonomy contains a root, the top level node under which all the other concepts are organized. A taxonomy is a tree, with no cycles, so that when the full taxonomy is given, a strict hierarchy is produced, with the bottommost nodes called leaves.

Given a taxonomy, the following types of changes are envisioned:

adding a node: addition of a node may occur either as a new leaf or as an internal node
deleting a node: removing a node from the taxonomy, again, either a leaf node or an internal node
merging nodes: aggregating two or more nodes, possibly leading to the deletion of a node
moving a subtree: changing the hypernym for a node, so that the node and all its children are moved in the taxonomy to another location
splitting a node: creating subsets of the definitions of a given node, renaming the new subsets, and positioning the subsets at an appropriate place in the taxonomy

In adding a node as a leaf, there is usually no difficulty as long as we retain the principles under which the taxonomy is being maintained. When we add an internal node, at some intermediate level of the taxonomy, we will need to consider whether we are adhering to these principles for what would be the children of the new internal node. (For the frame element taxonomy, the addition of nodes will be performed when there is a change in FrameNet.)

Deleting a leaf node should also be straightforward, since we will not be affecting any other nodes in the taxonomy. Deleting an internal node, however, will have repercussions for its child nodes. A decision will have to be made on what to do with these, either deleting them as well or moving them to other places in the taxonomy. (For the frame element taxonomy, the deletion of nodes will be performed when there is a change in FrameNet.)

Merging nodes first involves assessing the target node, i.e., determining whether to keep the name or creating a new name. Second, we have to determine what happens to all the children of each of the nodes being merged. In creating the frame element taxonomy, nodes were merged when there was some problem with the creation of the digraph image (e.g., a slash in the frame element name) or the node names were differed only in case. (For the frame element taxonomy, the merging of nodes will be performed when there is a compelling reason to do so; this reason will need to be stated explicitly.)

Moving a node, and possibly moving a subtree, involves changing the hypernym of the node. When doing so, it will be necessary to keep in mind the principles underlying the construction of the taxonomy and making sure that the children of the moved node will continue to adhere to these principles. (For the frame element taxonomy, the moving of nodes will be performed when there is a compelling reason to do so; this reason will need to be stated explicitly.)

Splitting a node is perhaps the most interesting operation in changing a taxonomy. First, it will be necessary to determine how to name the new nodes. Second, it will be necessary to identify how the children will be affected. In all likelihood, the children will be split into subsets, with some children going to each of the new nodes. (For the frame element taxonomy, the splitting of nodes will generally be based on an examination of the frame element definitions. It will generally be clear that a node being split has more than one sense. This operation is analogous to the splitting of a word’s meaning in a dictionary into subsenses.)

In making any changes to a taxonomy, it is important to keep a change log. In this way, the full explication of the taxonomy’s construction will be readily available for any further changes. (For the frame element taxonomy, a list of such changes may help the FrameNet lexicographer’s make modifications to the FrameNet data to ensure consistency.)

Preposition Classes: General

Ken — Thu, 21 Jan 2010 17:44:30 +0000

In The Preposition Project (TPP), each sense was assigned a semantic relation type by the lexicographer. These types were grouped together into 20 larger classes. The assignment of these two labels was a local decision, that is, without any a priori theoretical perspective. Once completed, the overall collection of these classifications are amenable to more detailed analysis. In particular, each class can be subjected to a digraph analysis and examined in relation to the other classes. In addition, the classes can be compared to the frame element hierarchy. The digraph analysis suggests that several of the classes are really subtypes of other classes. Examination of the frame element hierarchy assists in a clearer perception of the semantic roles filled by prepositional phrases.

The 20 classes of prepositions are: Activity, Agent, Backdrop, Barrier, Cause, Doubles, Exception, Means/Medium, Membership, Party, Possession, Quantity, Scalar, Spatial, Substance, Tandem, Target, Temporal, Topic, and Void. The 12 primitives in the frame element hierarchy are: Cause, State, Degree, Entity, Role, Purpose, Instrument, Phenomenon, Time, Path, Reason, and Topic. In the draft paper, Analysis of Preposition Classes, I provide a detailed examination of each class, providing a link to the digraph for the class and identifying the FrameNet frames and frame elements evoked by the class.

Preposition Disambiguation: State of the Art

Ken — Fri, 30 Oct 2009 16:17:56 +0000

Efforts to disambiguate prepositions have been increasing in the last few years, with claims of precision reaching 0.80. All such efforts present results in statistical generalities, with identification of the key factors related to the results. Continued progress in these efforts requires a close examination of limitations that have been noted. In addition, the exploitation of these results requires a close examination of the factors associated with each sense, so that the relevant information for each can be encoded in a meaningful way. This post summarizes the current literature on preposition disambiguation as a prelude to further developments of the data to be encoded in The Preposition Project (see sidebar link).

Meaningful preposition disambiguation requires a reasonably well-drawn sense inventory and a set of corpus instances that have been disambiguated by hand. The Preposition Project (TPP) provides these. They were used in SemEval 2007 and have since been used in other studies; such studies constitute the basic set of references. Another important study, initiated prior to TPP, is O’Hara and Wiebe (2009).

The basic SemEval task of disambiguating prepositions is described in Litkowski & Hargraves (2007). There were three participants in this task, whose results are described in Ye & Baldwin (2007), Yuret (2007), and Popescu, Tonelli, & Pianta (2007). Two later studies have also used the SemEval datasets or TPP data: Tratz & Hovy (2009) and Dahlmeier, Ng, and Schultz (2009).

Each of these studies makes use of statistical techniques (decision trees, maximum entropy, likelihood, and chain-clarifying relationships) to identify significant features associated with preposition disambiguation. These features are not identical, in some part as a result of analyzing different pieces of information. One major difficulty is the lack of any consensus on what is used to classify noun and verb types.

Ye & Baldwin used maximum entropy with three types of features: collocational features (open class words, WordNet synsets, named entities, surrounding words, and surrounding supersenses), syntactic features (parts of speech, chunk tags and types, and parse tree features), and semantic role features (semantic role tags, attached verbs, and verb relative positions). They found that collocational features played the most significant role. Tratz & Hovy also used maximum entropy, but focused on syntactic structures for identifying words of interest. They found the verb/noun dominating the prepositional phrase, the noun/verb object of the preposition, the subject of the dominating verb, neighboring prepositional phrases, and words within 2 positions of the target. For each word so identified, they then constructed feature sets consisting of the word itself, the lemma, part of speech, synset members, hypernyms, and capitalization. They achieved an 8 percent improvement in disambiguation and concluded that words bearing some syntactic relation to the target preposition were responsible for the improvement.

Yuret examines the context of the target preposition using a statistical language model. His method is applied more generally to content words, where he looks at the general task of word-sense disambiguation using possible substitutes as a way of selecting an applicable sense. The method depends on a rich set of substitutes, which is not the case for prepositions. He makes the point that good quality substitutes for prepositions are unlikely, since they play a unique role in language. Notwithstanding, his results are sufficiently above the baseline and are supportive of the Ye & Baldwin conclusions that collocational features are important. In addition, his results suggest that the TPP data for “other prepositions” associated with each sense might allow corpus instances for the different prepositions to be studied together.

Popescu et al. also examine the context of the preposition, but with the hypothesis that the collocational features constitute a mutual disambiguation process (chain-clarifying relationships). Their performance in SemEval was limited by the fact that the context words for the prepositions were not themselves disambiguated. Their method is based on a learning algorithm based on a supervised assignment of training instances (known as Angluin’s algorithm), but was limited to more superficial identification of the surrounding features.