A vast array of methods, particularly statistical, is used to categorize information and data. But many qualitative approaches are used as well, particularly in social science research. For the most part, such techniques are based on the investigator's intuitions about the meaning of categories, perhaps supported with statistical analysis. While such approaches can and should be continued, some other avenues have opened up with the developments in linguistics and the semantic theories supporting linguistic theories. This paper presents techniques for category development based on semantic principles (that is, principles for describing the meaning of words), particularly by weaving in the historical emergence of these principles.
To ground this discussion, the paper (1) characterizes some of the ways in which categories are used in social science, from the simple use of categories like gender in questionnaires, through category development in theory development, to highly intricate category systems involving hierarchical systems, and (2) looks briefly at category development for thesauruses and library cataloguing systems. The paper then describes, in the 1950s (the early days of computers), the beginnings of computerized information retrieval and text analysis, particularly from the perspective of their use of thesauruses and cataloguing systems. After providing a brief overview of Minnesota Contextual Content Analysis and lexical resources, the paper unfolds the principles of category development, based on research in linguistic formalisms continuing with ever richer grammars and semantic formalisms. The progression of these formalisms is described in the examination of the categories used in the Minnesota Contextual Content Analysis (MCCA) approach. Finally, current research toward an integration of semantic principles into content analysis describes abstraction procedures for characterizing the "category" of any text.
A survey researcher engaged in exploratory work may ask open-ended questions whose answers can be analyzed only by examining the texts of the responses. The researcher may have initial, sketchy conceptions of the categories into which the answers will fall. In questionnaire development, the researcher formulates questions where answers should identify a comprehensive set of alternatives (such as list of items, multiple choices, ranking scales, and Likert scales, and range, amount, and frequency intensities) (U. S. General Accounting Office 1993: 46-78). The set of possible answers should contain all the categories, should not overlap, and should have an appropriate level of specificity (U. S. General Accounting Office 1993: 102-9).
Content analysis of open-ended exploratory questions, verbatim transcripts of speech or interviews, or other free textual material is essentially theory development in which an analyst assigns categories to organize the textual material. This development is very difficult, very subjective, and frequently open to criticism of replicability and interrater reliability. Many investigators eventually create categories featuring particular words such as those expressing emotion expressing words. The analysis then consists of obtaining the frequencies of such words throughout the textual material. Of course, what constitutes an emotion expressing word is an important issue. Many content analysts have developed dictionaries, assigning words to different categories based on their individual judgments; these analysts may articulate criteria used for the development of their systems, sometimes stating that the words in a category share "semantic components," that is, common elements of meaning. However, the validity of these category systems can frequently be criticized. This paper describes the use of semantic principles for the development of criteria, with the goal of placing category development on a firmer basis.
A thesaurus, while presenting synonyms and antonyms, is generally organized by grouping words according to ideas. Thus, Roget's International Thesaurus (1992) uses 1,073 categories in 15 classes (with further loose groupings within the classes, down finally to pairings of opposites, such as Assent and Dissent). For example, the subclass Sex, with associated categories Masculinity and Femininity may be used to formulate gender based categories.
With the onset of computers, thesauruses and cataloguing systems gained considerable flexibility in permitting multiple terms or categories for characterizing textual materials. The primary purpose of these categories, of course, is for the retrieval of documents. Thesauruses became important adjuncts to cataloguing systems, since documents could be characterized by key words, the stock in trade of thesauruses.
Thesaurus development expanded dramatically with the advent of computer age in the 1950s. This expansion has continued unabated to the present. The process consists of identifying words and phrases used in documents and then placing them within a thesaurus. Unfortunately, with the rapid expansion of these activities, less attention is placed on the overarching schemata for a thesaurus. Instead, the emphasis has been on "local" placement decisions in which a new entry is related to other entries, primarily by linkages through synonyms, broader terms, narrower terms, and perhaps antonyms. The overall consistency of the thesaurus is seldom examined. Notwithstanding, available thesauruses of this type are valuable resources for category development.
Within the field of information retrieval, classification of documents is a primary endeavor. A considerable amount of research uses the existence of words in a text as the basis for "classifying" the text, often in relation to other texts and documents. This type of research focuses on the frequency of occurrence of words and uses sophisticated statistical techniques for the classification. While many of these techniques may be useful for category development, it is important to distinguish between classification and category development. The difference is largely one of scale, with classification generally focusing on whole texts (books, reports, and papers), while category development focuses on narrower text segments (individual words, phrases, and sentences). But, as category development attempts to cope with the larger text segments of paragraphs, speeches, and interviews, the boundary with classification begins to blur. The principles described in this paper show how category development may cope with these larger segments and perhaps eventually with classification.
In the MCCA dictionary of 11,000 words, the average number of words in a category is 95, with a range from 1 to about 300 (note 1). Each category is given a name, but these names are only heuristic in nature and have no essential meaning. The categories appear internally consistent in that the words in each category have an underlying similarity. However, the characteristics of the categories are not intuitively obvious. Firm principles for category construction can help extend the MCCA dictionary and improve the function of this program (McTavish, et al. 1997; Litkowski 1997). These principles are a part of the DIMAP dictionary creation and maintenance software (CL Research 1997). DIMAP includes MCCA as a module and improves the dictionary and the function of the technique by creating sublexicons for individual categories. These sublexicons are based on WordNet synsets, information from the Merriam-Webster Concise Electronic Dictionary, as well as the other resources described below.
When the part of speech of words in a category belong to open classes, analysis becomes a little more difficult. When the words are all in one class (that is, all nouns, verbs, adjectives, or adverbs), a unifying principle is sought from the hierarchical relationships among the words. One possible principle is that the words fall into a small number of categories in a thesaurus such as that of Roget. Another possibility is that the words are related by "broader than," "narrower than," or synonymic relations as assigned in keyword indexing thesauruses. Yet another possible principle is one used for dictionary definitions and consists of examining definitions of the words to identify an umbrella genus word with more specific terms underneath. Using WordNet, this step involves identifying the hierarchical groupings of the words in the category.
The remaining 80 or so categories in MCCA consist primarily of just such open-class words (nouns, verbs, adjectives, and adverbs), sprinkled with closed-class words (auxiliaries, subordinating conjunctions). Several categories consist of words from a single part of speech as is the case with Functional roles, Detached roles, and Human roles, which all include only nouns. To examine such unified sets of words, it is valid to examine their definitions for common genus terms. DIMAP implements the more convenient method of using WordNet to examine hierarchical relations as in Table 1, which shows a sample dictionary entry (note 5) where the field "Isa links" shows that "animal" is of type "creature".
Table 1: Lexical entries: Example of semantic featuresWord: #animal Type=r Code=#00026 No.Defs=1 Sense: 1 Cat: nil Isa links: #creature d-0 Features: EDIBLE = +boolean Word: #creature Type=r Code=#00025 No.Defs=1 Sense: 1 Cat: nil Isa links: #ind_obj d-0 Features: AGE = +scalar SEX = +gender
academic, artist, biologist, creator, critic, historian, instructor, observer, philosopher, physicist, professor, researcher, reviewer, scientist, sociologist.These words fall under the WordNet synsets headed by person (although not including this word), in particular, synsets headed by
creator; expert: authority: professional; intellectual.Other synsets under expert and authority do not fall into this category (and would thus be included in other MCCA categories). Thus, it is possible to characterize Detached roles as words used to describe persons performing intellectual or thinking activities. This is a concept well captured by its heuristic name, and distinguished from Human roles such as uncle or bride and Functional roles such as janitor or firefighter. Identification of these synsets facilitates extension of the MCCA dictionary for this category to include further hyponyms (that is, types of creators, professionals, or intellectuals) of these synsets.
Laffal (1995) likewise based his dictionary 43,000 words and 168 concepts on semantic features, coding words in the same category based on the "core meanings of words," that is, having the same semantic component. Nida (1975: 174) characterized a semantic domain as consisting of words sharing semantic components. However, he also suggests (Nida 1975: 193) that domains represent an arbitrary grouping of the underlying semantic features.
Thus, we see that the 1960s development of the notion of semantic features has become a very prominent basis for the development of category systems. The subtrees rooted at particular nodes in the WordNet hierarchies provide a readily available basis for category development that reflects (implicit) assignment of common semantic features and components. Litkowski (1997) proposes making these semantic features and components more explicit, specifically for the purpose of facilitating category development.
Table 2: Lexical entries: Example of syntax and semantic rolesWord: eat Type=r Code=e00000 No.Defs=1 Sense: 1 Cat: vrb Defin: ingest solid food through mouth and swallow it Isa links: #ingest d-0 Features: root = $var0 subj = ((root $var1) (cat n)) obj = ((root $var2 optional) (cat n)) AGENT = ^$var1 THEME = ^$var2
applaud, applause, approve, congratulate, congratulation, convict, conviction, disapproval, disapprove, honor, judge, judgment, judgmental, merit, mistreat, reject, rejection, ridicule, sanction, scorn, scornful, shame, shamefully.While this set of words includes words from several parts of speech (discussed in more detail below), it is rooted primarily in the Levin (1993) verb sets of Characterize (class 29.2), Declare (29.4), Admire (31.2), and Judgment (33). This means that the set has particular syntactic and semantic patterning in addition to the synonymic and hierarchical relations that can be discovered using the techniques described in the previous section. Levin has identified a considerable set of syntactic properties associated with the classes she has developed (and thus a useful resource itself for category development), but has not yet formally characterized the semantic properties. Instead, the definition of this class might, following Davis (1996), inherit a sort notion-rel, which has a "perceiver" and a "perceived" argument (thus capturing syntactic patterning) with perhaps a selectional restriction on the "perceiver" that the type of action is an evaluative one (thus providing semantic patterning). In other words, the underlying conceptualization of the MCCA category indicates that there is an action involved (as indicated by the verb), that this action involves some idea or notion on the part of the actor (the "perceiver"), and that this notion (the "perceived") is inherently an evaluation.
WordNet synsets explicitly contain some syntactic information and implicitly some semantic role information. However, it does not have the depth required for the analysis described above. Other resources, such as Levin (1993), as well as some databases being constructed for on-line access, contain more of this detail. What this means for purposes of characterizing and extending the words in the category Sanction is that not only can the WordNet hierarchy be used, but also it is possible to include words that correspond to conversion of verb concepts into noun counterparts (for example, the action judge corresponds to the result of a judging action, that is, a judgment).
As alluded to in the last section, a restriction was placed on the type of notion involved in the use of a word in the Sanction category, namely, that it had to be evaluative in nature. Table 3 shows a lexical entry for the preposition in with two senses. Basically, this entry says that in is used to begin prepositional phrases (the "pp-adjunct") with noun phrase objects. In the first sense, this says that the phrase may be attached to another noun phrase which may be an "object" or an "event" and that the object of the prepositional phrase is a location in some physical object. The second sense says that the prepositional phrase is attached to a verb which describes an event and that the object of the preposition describes a location which may additionally be characterized as a destination. These specifications are called selectional restrictions and serve to limit the range of words that may appear in the identified syntactic positions.
Table 3: Lexical entries: Example of selectional restrictionsWord: in Type=r Code=i00000 No.Defs=2 Sense: 1 Cat: prp Defin: located within the confines of Features: root = $var1 pp-adjunct = ((root $var0) (obj ((root $var2) (cat n)))) ^$var1 = (*OR* +object +event) (location ^$var2 +physobj) Sense: 2 Cat: prp Defin: into the destination of Features: root = $var1 pp-adjunct = ((root $var0) (obj ((root $var2) (cat n)))) ^$var1 = +event (destination ^$var2 +location (relaxable-to +physobj))
Table 4: Lexical entries: Example of semantic relationsWord: #event Type=r Code=#00012 No.Defs=1 Sense: 1 Cat: nil Isa links: #all d-0 Features: SUBEVENTS = +event SUBEVENT-OF = +event TIME = > 0 (MEASURING-UNIT +second) LOCATION = +place CAUSED-BY = +event CAUSES = +event PRECONDITION = +event EFFECT = +event
Table 5: Lexical entries: Example of knowledge base dataWord: #teach Type=r Code=#00014 Isa links: #communicative-event d-0 Features: AGENT = +intentional-agent (default +teacher) THEME = +knowledge BENEFICIARY = +intentional-agent (default +student) PRECONDITION = (default (*AND* #teach-know-1 (NOT #teach-know-2))) EFFECT = (default #teach-know-2) SUBEVENTS = (*AND* #teach-describe #teach-request-info #teach-answer) Word: #teach-answer Type=r Code=#00019 Isa links: #answer d-0 Features: AGENT = +teach.agent THEME = +teach-request-info.theme BENEFICIARY = +teach.beneficiary Word: #teach-describe Type=r Code=#00017 Isa links: #describe d-0 Features: AGENT = +teach.agent THEME = +teach.theme BENEFICIARY = +teach.beneficiary
Several more elaborate forms of relations are also possible. For the purpose of illustrating these additional derivational rules, we introduce another MCCA category, known as Normative. This is a complex category consisting of 76 words, and like the Sanction category, also has words from all parts of speech. This category includes the following (along with various inflectional forms):
absolute, absolutely, consequent, consequence, consequently, correct, correctly, dogmatism, habit, habitual, habitually, ideologically, ideology, necessarily, necessary, norm, obviously, prominence, prominent, prominently, regular, regularity, regularly, unequivocally, unusual, unusuallyThe use of the heuristic Normative to label this category clearly reflects the presence in these words of a semantic component oriented around characterizing something in terms of expectations or standards. Of particular interest here are the derivational relations that form adjectives from nouns, nouns from adjectives, and adverbs from adjectives. There were similar kinds of relations in the Sanction category, where most of the concepts seemed to be based on underlying verb forms. In that category, a number of words were clearly noun, adjective, and adverb derivations from the underlying verbs.
These derivational relations can be encoded in lexical entries in the same way as the semantic relations shown in Table 4. The feature name in such relations would describe the relation (such as "nominalization") with a value identifying the derived form, which would also be a lexical entry having the inverse relation ("nominalization_of"), with a value showing the base form of the word. Some of these relations are shown in WordNet, but a more complete source is a dictionary which shows an ordering of derived forms. The MRD included with DIMAP shows these forms.
The adverb derivations in the Normative category have an additional interesting aspect to them. The heuristic Reasoning has also been used to label this category. When we examine the syntactic and semantic nature of these adverbs, we find that they are considered to content disjuncts (Quirk, et al. 1985: 8.127-33), that is, words indicating that the speaker is making a comment on the content of what the speaker is saying, in this case, compared to some norm or standard. Thus, part of the defining characteristics for this category is a specification for lexical items that have a [content-disjunct +] feature. In analyzing text that contains such words as necessarily, obviously, unequivocally, and consequently, we would thus be able to recognize the presence of editorial commentary. This indicates the value of using non-database sources that describe syntactic and semantic characteristics of the language.
The final type of lexical rule considered here is more subtle and involves the observation that a word may have several senses that are related to one another (usually with one sense as the base from which all the others have been derived). A simple example of such a rule is the word "fish." The base sense of this word refers to an individuated object that is countable; the derived sense is where it refers to the food sense, where the object is not individuated but an undifferentiated mass or substance. Another example of the same process is use of the word "coffee." A lexical rule has been developed to encode this regularity in language and is shown as a lexical entry in Table 6. Note that there is a general rule of "grinding" and then a more detailed entry for "animal-grinding." For the more general rule, a count noun is converted into a mass noun, taking it from an individuated object to a substance. In the more specific rule, the count noun is required to be an animal and then the derived form is a food-substance. Table 7 shows how this might be encoded in a dictionary entry for the word "coffee," where sense 2 of the word is derived from sense 1.
Table 6: Lexical entries: Example of lexical rulesWord: #grinding Type=r Code=#00032 Sense: 1 Cat: nil Isa links: #lexical-rule Features: 0 = +count-noun (ORTH $var0) (RQS +ind_obj) 1 = +mass-noun (ORTH $var0) (RQS +substance) Word: #animal-grinding Type=r Code=#00033 Sense: 1 Cat: nil Isa links: #grinding Features: 0 = (RQS +animal) 1 = (RQS +food-substance)
Table 7: Lexical entries: Example of sense relationsWord: coffee Type=r Code=c00000 No.Defs=2 Sense: 1 Cat: nou Defin: a kind of bean which is roasted and ground to produce coffee-2 Isa links: #coffee-bean d-0 Features: count = + proper = - Sense: 2 Cat: nou Defin: a hot drink made from coffee-1 Features: count = - proper = - Role: #grinding coffee(1)
Burstein, et al. (1996) describe techniques for using lexical semantics (that is, using the information described in lexical entries) to classify responses to test questions. An essential component of this classification process is the identification of sublexicons that cut across parts of speech, along with concept grammars that allow the collapsing of phrases and clauses into a generalized representation that abstracts away from the reliance on individual words. As seen above in the procedures for defining MCCA categories, addition of lexical semantic information in the form of derivational and morphological relations (that is, word formation rules) and semantic components common across part of speech boundaries would facilitate the development of concept grammars.
Litkowski & Harris (1997) discuss extension of a discourse analysis algorithm incorporating lexical cohesion principles. These principles show how the information in lexical entries, particularly selectional specifications on verbs, maintain lexical cohesion of a discourse. With such information, it is possible to understand how the individual components of a text fit together, and in particular, shows that particular phrases and sentences are elaborations of others (and hence not an essential part of its categorization). As a result, it is possible not only to provide a more coherent discourse analysis of a text segment, but also possibly to summarize the text better and thus provide an overall categorization of a text, rather than just a classification.
Burstein, J., Kaplan, R., Wolff, S., & Lu, C. (1996). Using lexical semantic information techniques to classify free responses. In E. Viegas & M. Palmer (Eds.), Breadth and Depth of Semantic Lexicons. Workshop Sponsored by the Special Interest Group on the Lexicon. Santa Cruz, CA: Association for Computational Linguistics.
CL Research. (1997 - in preparation). DIMAP-3 users manual. Gaithersburg, MD.
Davis, A. R. (1996). Lexical semantics and linking in the hierarchical lexicon [diss], Stanford, CA: Stanford University.
Evens, M., Litowitz, B., Markowitz, J., Smith, R., & Werner, O. (1980). Lexical-semantic relations: A comparative survey. Edmonton, Alberta: Linguistic Research, Inc.
Fillmore, C. J. (1968). The case for case. In E. Bach & R. Harms (Eds.), Universals in linguistic theory (pp. 1-90). New York: Holt, Rinehart, and Winston.
Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39, 170-210.
Laffal, J. (1995, October). A concept analysis of Jonathan Swift's A Tale of a Tub and Gulliver's Travels. Computers and the Humanities, pp. 339-361.
Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago, IL: The University of Chicago Press.
Litkowski, K. C. (1997). Desiderata for tagging with WordNet synsets and MCCA categories. 4th Meeting of the ACL Special Interest Group on the Lexicon. Washington, DC: Association for Computational Linguistics. (Available at CL Research)
Litkowski, K. C., & Harris, M. D. (1997). Category development using complete semantic networks. Technical Report, vol. 97-01. Gaithersburg, MD: CL Research. (Available at CL Research)
Markowitz, J., Ahlswede, T., & Evens, M. (1986, June 10-13). Semantically Significant Patterns in Dictionary Definitions. 24th Annual Meeting of the Association for Computational Linguistics. New York, NY: Association for Computational Linguistics.
McTavish, D. G. (1997b). Scale Validity: A Computer Content Analysis Approach. Social Science Computer Review, to appear.
McTavish, D. G., Litkowski, K. C., & Schrader, S. (1997a). A computer content analysis approach to measuring social distance in residential organizations for older people. Social Science Computer Review, 15(2), 170-180. (An earlier version available at CL Research)
McTavish, D. G., & Pirro, E. B. (1990). Contextual content analysis. Quality & Quantity, 24, 245-265. (Available at CL Research)
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235-244.
Nida, E. A. (1975). Componential analysis of meaning. The Hague: Mouton.
Roget's International Thesaurus (R. L. Chapman, Ed.) (5th). (1992). New York: HarperCollins Publishers, Inc.
Quirk, R., Greenbaum, S., Leech, G., & Svartik, J. (1985). A comprehensive grammar of the English language. London: Longman.
U. S. General Accounting Office. (October 1993). Developing and using questionnaires. GAO/PEMD-10.1.7. Washington, D.C.
Schank, R. C. (1975). Conceptual information processing. Amsterdam: North-Holland.
Schank, R. C., & Abelson, R. (1977). Scripts, plans, goals and understanding. Hillsdale, NJ: Lawrence Erlbaum.
Whissell, C. (1996). Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Computers and the Humanities, 30(3), 257-265.
Winograd, T. (1972). Understanding natural language. New York: Academic Press.
2. A lexicon includes phrases as well as individual words. A phrase in a lexicon has the same conceptual status as a word and hence be characterized in the same way as a word. Recognizing phrases in text analysis is very difficult. Since this paper is not concerned with the actual mechanics of text analysis, use of the term phrases is avoided for the sake of simplicity of presentation.
3. Described also on the World Wide Web at http://www.cogsci.princeton.edu/~wn/, from which the database may be downloaded.
4. Closed classes are syntactic categories, such as prepositions or pronouns, that have relatively few words and are unlikely to have new words. Open classes are nouns, verbs, adjectives, and adverbs; these classes expand as the language evolves.
5. All lexical entries shown in the tables were created directly from DIMAP entries in the exact format shown. DIMAP allows the user to specify which parts of an entry are to be extracted and in what format.
6. The Special Interest Group on the Lexicon of the Association for Computational Linguistics maintains a set of links to publicly available lexical resources on the World Wide Web at http://www.clres.com/siglex.html.