Semantic Primitives

In a recent posting to CORPORA on the topic of semantic primitives, John Sowa says,

The so-called primitives are the result of analysis by adults who have learned how to write dissertations about language. I believe there are no primitives that are truly primitive in the sense that they cannot be analyzed in different ways by different adults with different biases.

While I won’t argue with John, I do believe such statements can have a discouraging effect on useful research. Throughout the 1970s and 1980s, research on machine-readable dictionaries (MRDs) was quite the rage. However, in 1991, Jean Veronis and Nancy Ide wrote a paper, “An Assessment of Semantic Information Automatically Extracted from Machine Readable dictionaries.” They concluded that 55 to 70 percent of the data was garbled in some way. This paper had a similar discouraging effect on MRD research. I have been engaged in MRD research for 40 years and would like to suggest that the search for primitives is not without value.

In general, I agree with John’s assessment of research on primitives. However, I think the one major reason for this is that such work tends to be very a priori, i.e., the result of people thinking about what should constitute a set of primitives. I agree that these various efforts are not very convincing. I have been critical of 1000 words of Basic English and the use of a defining vocabulary of 2000 words for the Longman’s dictionaries. I have been skeptical about Wierzbicka’s primitives and the WordNet tops because they were not derived from evidence. I am very skeptical about the root nodes in ontologies (all the rage these days).

In late 2001 and early 2002, I was fortunate to do some work for the Oxford Dictionary of English (ODE) in identifying what was called superordinates of noun definitions. At the time, Oxford had developed a noun hierarchy that was based in part on WordNet. I developed routines to parse these definitions and to generate these superordinates. I estimated that my work was about 85 percent correct and I was allowed to proceed. My exercise was a starting point for the Oxford lexicographers, who then following in behind me to correct my work and to use their expertise to fill in what I had not been able to accomplish. I believe the quality of this exercise is now visible in their new online dictionary (which includes large sets of sentences from the Oxford English Corpus). This is certainly not the ultimate in identification of primitives, but it has the significant advantage that it is data-driven, i.e., a posteriori.

In addition to this exercise, I have made frequent use of digraph analysis of sets of definitions in an attempt to find primitives. I have done this for prepositions (preposition classes in the preposition project), FrameNet’s frame elements, and 13,000 verbs in the Macquarie dictionary. Certainly not a be-all and end-all, but helpful in attempting to make sense of our wonderful language.

In some recent conversations on noun compounds with Robert Amsler (whose verb taxonomy derived from Merriam-Webster’s pocket dictionary set the tone for MRD research at the end of the 1970s), he has talked about how he views the potential of the latest Google corpus for his efforts. Specifically,

The simple rule that we accept in chemistry, that a compound may not exhibit any of the properties of its component elements, doesn’t seem to be understood as essentially true in the lexicon as well.

I think the important point here is that, hopefully in some scientific manner, we can proceed to analyze these sorts of phenomena. I hope we don’t take John’s words to stifle such research.

