Oxford University Press-CL Research Collaboration

Contents of NODE and NODE+DIMAP

Machine-Readable Version (MRD) of the New Oxford Dictionary of English (NODE)

The MRD version of NODE comprises 46.6 MB of data (see sample--right click on document to print), marked up with tags. The distribution includes a full description of these tags. An entry, marked as either standard or encyclopedic, is divided into five main blocks: headword, part-of-speech/definitions, phrases, derivations, and phrasal verbs.

Headword Block
Contains the headword (which may be a multiword unit or an encyclopedic name), and may contain a homograph identifier, subject and/or specialist labels, pronunciations, variants, register, and geographic label.
Part of Speech/Definition Block
Contains a part of speech and a variable number of definitions associated with each part of speech, organized into core senses, subsenses, and technical or additional information. Includes inflections and variants appropriate to the part of speech or the sense or subsense. May include grammatical characterizations (see full list--right click on document to print) for nouns (mass, count, attributive), verbs (with or without object, with adverbial), adjectives (attributive, predicative, postpositive), and adverbs (sentence, submodifier), as well as the (inflected) form in which a specific sense is used. May also include register, subject and/or specialist labels, and geographic labels specific to the sense or subsense. Includes illustrative examples for each sense or subsense, taken from the British National Corpus or the Oxford Reading Programme (many of which also identify collocations in which the headword is used).
Phrase Block
Contains idioms and phrases in which the headword commonly appears, not infrequently including phrasal variations. These phrases are fully defined, following the format of the definition block.
Derivation Block
Contains undefined derivations of the headword, along with their parts of speech.
Phrasal Verb Block
Contains verb phrases that begin with a verb headword. These phrases are fully defined, following the format of the definition block.

CL Research and DIMAP Enhancements of NODE

The CL Research enhancements of NODE consist of (1) the conversion program used to transform NODE data into a format suitable for upload to DIMAP dictionaries, (2) DIMAP itself, (3) a set of DIMAP dictionaries (about 20 MB in compressed form) that provide the NODE data in a machine-tractable and viewable format, and (4) a set of DIMAP dictionaries in which the NODE definitions have been parsed and semantic links created between the entries.

Conversion Program
The source code (in C++) by which NODE data were formatted for upload to DIMAP dictionaries is provided. While this program is specific to the creation of DIMAP dictionaries, it can easily be modified to remove the output lines and maintain the structural analysis of the NODE data. This program alone justifies the additional licensing fee for NODE+DIMAP.
The NODE+DIMAP distribution includes DIMAP itself, as well as any enhancements made during the licensing period. DIMAP not only enables the examination of entries sense by sense, but also provides a substantial range of functionality to manipulate and examine the dictionary data, including the parsing of definitions. In addition, DIMAP includes the functionality used by CL Research in word-sense disambiguation (Senseval) and question-answering (TREC).
Machine-Tractable DIMAP Versions of NODE
The DIMAP version of NODE provides a GUI interface to the NODE data and separates the data into distinct elements (generally in attribute-value pairs). Of particular note is that the conversion program creates headword entries for all variants, derivations, and phrasal entries. With the DIMAP functionality for regular-expression searching on any field and for extracting data following a user-specified template, the NODE data becomes machine-tractable. Although the conversion program is now relatively stable, changes will occur in these DIMAP dictionaries as even more data is mined during the licensing period.
Parsed DIMAP Versions of NODE with Semantic Links
Parsing the definitions of NODE to create semantic links, including hypernyms or superordinates, typical subjects and objects of verbs, synonyms, location, goal, purpose, manner, member-of, has-parts, and others. With DIMAP's primitive-finding digraph algorithm, it is possible to analyze the semantic network for parts of speech and subdictionaries (such as thesaurus groupings). With CL Research's commitment to improving definition parsing and the use of this information in word-sense disambiguation and question-answering, and our relationship with OUP lexicographers, enhancements to the parsed DIMAP NODE dictionaries will be available during the licensing period.


This document maintained by Kenneth Litkowski ken@clres.com .
Material Copyright © 2001 CL Research