The Dictionary Parsing Project: Steps Toward a Lexicographer's Workstation

Kenneth C. Litkowski

CL Research

9208 Gue Road

Damascus, MD 20872

ken@clres.com

http://www.clres.com

The Dictionary Parsing Project (DPP) (CL Research, 1999) is intended to create publicly available semantic networks and ontologies based on parsing dictionary definitions. The DPP has drawn components and suggestions from CL Research, the USC Information Sciences Institute (ISI), Micra Inc., and Franklin Technology Inc. This demonstration focuses on the use of CL Research's DIMAP (DIctionary MAintenance Programs) and on the steps involved in creating and using the semantic network from creation of a dictionary through use of the semantic network in word-sense disambiguation. While the steps can be seen as aids to a lexicographer developing dictionary definitions, many of them have immediate application in natural language processing. The demonstration particularly identifies how lexicon development processes translate into lexical elements used in parsing.

The DPP consists of three main components: a large machine-readable dictionary (MRD), a near industrial-strength parser, and software capable of analyzing parse results to build a semantic network. The MRD is the publicly available Webster's Revised Unabridged Dictionary (1913) (hereafter WR), transcribed and made available by Pat Cassidy of Micra, Inc. and includes 110,000 head words and 270,000 definitions. (The techniques employed have been applied to other dictionaries, including Webster's Third International Dictionary (1966), WordNet (Miller, et al. 1990), and The American Heritage Dictionary of the English Language (1992).) A subset of the data from WR have been extracted, using Perl programs developed by Bruce Jakeway and Ed Hovy of ISI, into a form suitable for more detailed analysis (recognition of ontological elements (Information Sciences Institute, 1998) and parsing using DIMAP). The data were then transformed into a format for uploading into DIMAP dictionaries, which provides a general data structure suitable for a wide range of notational capabilities. (All these data are publicly available.)

The parser (400 definitions per minute on a 266 MHz Pentium II with 64 MB RAM) was developed and provided by Ned Irons of Franklin Technology, Inc. (Irons, 1999) The parser is an ATN-style grammar of 350 productions. Each production consists of a start state, a condition to be satisfied (either a non-terminal or a lexical category), and an end state. When a condition is satisfied, an "action" program makes further annotations or grows the parse tree. One unique feature of the parser is that, based on the input word (e.g., a verb), "dynamic" parsing goals are added to test for subcategorization patterns. The output of the parser is in the form of bracketed parse trees, with constituents down to leaf nodes consisting of the part of speech and lexical entry giving characteristics of the parsed lemma. Annotations, such as number and tense information, may be included at any node. The dictionary (of 40,000 words, easily extensible) used by the parser is based on the Oxford Advanced Learner's Dictionary (1989) and contains several characteristics of words beyond their part of speech, including verb subcategorization patterns used in the dynamic parsing. (Source code in C and documentation is available. This compiles under VC 6.0, Borland CBuilder3, Sun4, Linux, and BSD UNIX.)

Once definitions are available in DIMAP, they are parsed by placing the definitions in sentence frames based on the characteristics of the definition. The DIMAP functionality contains several specialized routines for examining and analyzing the parse results, the most notable of which is a capability for creating annotated regular expression "defining patterns" that are compiled and integrated into the parsing dictionary for seamless identification of semantic relations. For example, the following recognizes manner relations from prepositional phrases beginning with the word in:

in(dpat((~ rep01(det(0)) adj manner(0) sr(manner)))).

with the adjective identified as being in a manner relation to the head word for the definition that is parsed.

When these relations are identified, they are then integrated into fields of the underlying DIMAP dictionary, where they then become available for different types of further analysis and use in parsing free text. (The DPP is working with several inventories of semantic relations (CL Research, 1999); the semantic relations and networks created in the DPP have considerable similarity to those established in Microsoft's MindNet (Microsoft, 1999). Currently, we have identified 234,000 relations in WR, 0.87 per definition; this compares with 3.26 relations in MindNet.)

The lexicographical functionality allows searching the dictionary entries using regular expressions (e.g., definitions containing "which" followed by a word ending in "s" or all senses containing a manner component) in any of the fields used for displaying senses; this facilitates examination of particular defining patterns. After parsing dictionary entries and creating additional data in entries for the semantic relations, the lexicographer can then compare and map entries between dictionaries using word overlap analysis or componential analysis techniques (Litkowski, 1999).



After adding semantic relations or syntactic specifications to the dictionary entries, the entries can then be tested against corpora of texts providing sample usages of a word, as used in Senseval (Kilgarriff, 1998). DIMAP parses unrestricted text, but can be particularly used to test the effectiveness of syntactic, semantic, and collocational specifications of a word against a tagged or untagged corpus. DIMAP performs word-sense disambiguation that can allow the lexicographer to determine how well a sense's components separate usages into its different sense classes; this facilitates the process of improving disambiguation techniques. The results of this parsing can be evaluated against a "gold standard" (such as supplied in Senseval) and used for further development of definition parsing techniques (including development of "defining patterns" used to recognize semantic relations), which can in turn be used iteratively to assess improvement in the parsing of the "gold standard" texts.

In summary, the Dictionary Parsing Project provides a testbed that can be used not only for creating a semantic network or ontology of concepts, but also for developing techniques that can be applied immediately in parsing unrestricted text and identifying a semantic representation of the texts for the assessment of word-sense disambiguation techniques. Even though the project is in its early phases, it has already experienced demonstrable success. The inherent bootstrapping character of the steps that have been laid out foretokens further improvements.

References

The American Heritage Dictionary of the English Language (A. Soukhanov, Ed.) (3rd). (1992). Boston, MA: Houghton Mifflin Company.

CL Research. (1999). Dictionary Parsing Project (CL Research). Available: http://www.clres.com/dpp.html

Information Sciences Institute. (1998). Dictionary Parsing Project. Available: http://www.isi.edu/natural-language/dpp/

Irons, N. (1999). Franklin Sentence Parsing Program. Available: http://proximity.franklin.com/Parse.htm

Kilgarriff, A. (1998). SENSEVAL Home Page. Available: http://www.itri.bton.ac.uk/events/senseval/

Litkowski, K. C. (1999, 21-22 June). Towards a Meaning-Full Comparison of Lexical Resources. Association for Computational Linguistics Special Interest Group on the Lexicon Workshop. College Park, MD.

Microsoft. (1999). About Microsoft Research: Technical Reports (MindNet, Vanderwende, Dolan, Richardson). Available: http://research.microsoft.com/pubs/

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235-244.

Oxford Advanced Learner's Dictionary (4th). (1989).

Webster's Revised Unabridged Dictionary (N. Porter, Ed.). (1913). G & C. Merriam Co.

Webster's Third International Dictionary. (1966). Chicago: Encyclopedia Brittanica.