Fulfilling the Firthian Maxim

J. R. Firth’s famous quotation, “You shall know a word by the company it keeps,” is cited as the beginning of corpus linguistics, the study of language as expressed in samples. This approach had great success in the growth of English lexicography. In 1990, the advent of computerized samples (corpora) brought about the emergence of a statistical approach to word behavior in computational linguistics, with the paper by Church & Hanks on word association norms and mutual information. As corpora have grown, so too has their analysis, particularly with word sketches, which provide a corpus-derived summary of a word’s grammatical and collocational behavior. Statistical characterizations of a  word’s behavior have found many uses, but here we want to focus on their use in lexicography. Word sketches have been used by lexicographers in developing definitions for dictionaries. Increasingly, they also keep a record of the sentences they use as the basis for each definition, i.e., the company that the word keeps. Such sentences can be viewed as sense-disambiguated, at least with respect to the sense inventory that has  been developed. With many sense inventories and their respective corpus instances, there is an opportunity for testing the consistency with which humans have classified the instances. Such consistency checking can be done both internally and across different resources. The emphasis of the consistency checking is on the “You” in Firth’s maxim. We explore how this can be accomplished.

A significant open problem in computational linguistics is word-sense disambiguation (WSD). As currently formulated, this task is the problem of assigning senses to a corpus of instances, given a sense inventory. Difficulties noted in the literature include differences among dictionary sense inventories, the granularity of the senses, and disagreements among annotators in making sense assignments. What we want to do is to address these difficulties in a more systematic fashion by turning the WSD task around. Specifically, we want to examine the corpus instance classifications to determine the extent to which the lexicographers have been consistent in characterizing the company a word keeps.

To perform this task, we use the following steps:

  • Extracting the instances from a resource’s data for each sense, keeping track of whatever properties the resource uses to characterize a sense and collecting all the examples that have been associated with the sense,
  • Tagging the instances to obtain a set of part-of-speech tags and the lemmas for each token, doing so with a single tagger so that results across resources can be compared,
  • Analyzing the tags to locate the target word and to identify other phrases that stand in particular syntactic structure to the target (and identifying the heads of such phrases), and
  • Comparing two resources to determine the extent to which the profile of the analyses for each sense corresponds to each other, over all senses.

The last two steps form the essence of satisfying the Firthian maxim, characterizing the company a word keeps. Clearly, both steps can become quite involved. We have only scratched the surface in our development of appropriate methods, hopefully at least providing a proof of concept that this approach will be useful.

Our analysis has involved several prominent lexical resources: the Oxford Dictionary of English (ODE) sentence dictionary, the Pattern Dictionary of English Verbs (PDEV), the Dictionary of Analysed Texts of English (DANTE), FrameNet (FN), and WordNet (using SemCor). We have only examined one word, abandon, in its verb senses, and have thus far only used one criterion, its object, as the basis for the tag analysis and the resource comparison. These resources have the following properties:

  • ODE: 7 senses, 118 sentences
  • PDEV: 7 senses, 228 sentences
  • DANTE: 9 senses, 50 sentences
  • FN: 3 senses, 20 sentences
  • WN: 5 senses, 19 sentences

Each sense in each resource is identified as having a noun phrase object. However, the object is not always immediate, since the verb in the corpus instances is frequently in the passive voice (where the surface subject is actually the object of the verb) or used as a past participle modifying a noun (taken to be the object). The tag analysis attempts to find these objects, and in our preliminary implementation, succeeded in about half the cases. We used the lemma corresponding to the head of the noun phrase as the basis for comparing the five resources. The comparison looked at two resources at a time, arraying the senses of each with the other, counting the number of heads in common for each cell in the matrix.

Our analysis yielded several observations:

  • Lexicographically well-drawn sense inventories (ODE, PDEV, and DANTE) were generally consistent with each other, with common heads intersecting in a single sense.
  • When sense inventories differing in size were mapped (e.g., PDEV and DANTE), common heads intersected in more than one sense, suggesting that the inventory with the larger number of senses had “split” senses that had been “lumped” by the other inventory (i.e., addressing the issue of sense granularity).
  • Some mapping revealed inconsistencies. For example, in FN, corpus instances including “abandon a project” appeared under multiple senses, suggesting a violation of the Firthian maxim.
  • Some tag analysis showed internal inconsistency within a single sense inventory, e.g., “abandon a plan” appeared under two senses in PDEV.

These observations are only preliminary, and clearly need further substantiation with more words and with more tag analysis. However, they are intriguing, since they are supportive of the Firthian maxim. They also support basic lexicographic notions of lumping and splitting. Finally, the observations give some understanding of the potential source of difficulties in WSD.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>