Contents of NODE and NODE+DIMAP
Machine-Readable Version (MRD) of the New Oxford Dictionary of English
(NODE)
The MRD version of NODE comprises 46.6 MB of data (see sample--right click on document to print), marked up with tags. The
distribution includes a full description of these tags. An entry, marked as
either standard or encyclopedic, is divided into five main blocks: headword,
part-of-speech/definitions, phrases, derivations, and phrasal verbs.
- Headword Block
- Contains the headword (which may be a multiword unit or an encyclopedic
name), and may contain a homograph identifier, subject and/or specialist
labels, pronunciations, variants, register, and geographic label.
- Part of Speech/Definition Block
- Contains a part of speech and a variable number of definitions associated
with each part of speech, organized into core senses, subsenses, and technical
or additional information. Includes inflections and variants appropriate to
the part of speech or the sense or subsense. May include grammatical
characterizations (see full list--right click on document to print) for nouns (mass, count, attributive), verbs (with or without
object, with adverbial), adjectives (attributive, predicative, postpositive),
and adverbs (sentence, submodifier), as well as the (inflected) form in which
a specific sense is used. May also include register, subject and/or specialist
labels, and geographic labels specific to the sense or subsense. Includes
illustrative examples for each sense or subsense, taken from the British
National Corpus or the Oxford Reading Programme (many of which also identify
collocations in which the headword is used).
- Phrase Block
- Contains idioms and phrases in which the headword commonly appears, not
infrequently including phrasal variations. These phrases are fully defined,
following the format of the definition block.
- Derivation Block
- Contains undefined derivations of the headword, along with their parts of
speech.
- Phrasal Verb Block
- Contains verb phrases that begin with a verb headword. These phrases are
fully defined, following the format of the definition block.
CL Research and DIMAP Enhancements of NODE
The CL Research enhancements of NODE consist of (1) the conversion program
used to transform NODE data into a format suitable for upload to DIMAP
dictionaries, (2) DIMAP itself, (3) a set of DIMAP dictionaries (about 20 MB in
compressed form) that provide the NODE data in a machine-tractable and viewable
format, and (4) a set of DIMAP dictionaries in which the NODE definitions have
been parsed and semantic links created between the entries.
- Conversion Program
- The source code (in C++) by which NODE data were formatted for upload to
DIMAP dictionaries is provided. While this program is specific to the creation
of DIMAP dictionaries, it can easily be modified to remove the output lines
and maintain the structural analysis of the NODE data. This program alone
justifies the additional licensing fee for NODE+DIMAP.
- DIMAP
- The NODE+DIMAP distribution includes DIMAP itself, as well as any
enhancements made during the licensing period. DIMAP not only enables the
examination of entries sense by sense, but also provides a substantial range
of functionality to manipulate and examine the dictionary data, including the
parsing of definitions. In addition, DIMAP includes the functionality used by
CL Research in word-sense disambiguation (Senseval) and question-answering
(TREC).
- Machine-Tractable DIMAP Versions of NODE
- The DIMAP version of NODE provides a GUI interface to the NODE data and
separates the data into distinct elements (generally in attribute-value
pairs). Of particular note is that the conversion program creates headword
entries for all variants, derivations, and phrasal entries. With the DIMAP
functionality for regular-expression searching on any field and for extracting
data following a user-specified template, the NODE data becomes
machine-tractable. Although the conversion program is now relatively stable,
changes will occur in these DIMAP dictionaries as even more data is mined
during the licensing period.
- Parsed DIMAP Versions of NODE with Semantic Links
- Parsing the definitions of NODE to create semantic links, including
hypernyms or superordinates, typical subjects and objects of verbs, synonyms,
location, goal, purpose, manner, member-of, has-parts, and others. With
DIMAP's primitive-finding digraph algorithm, it is possible to analyze the
semantic network for parts of speech and subdictionaries (such as thesaurus
groupings). With CL Research's commitment to improving definition parsing and
the use of this information in word-sense disambiguation and
question-answering, and our relationship with OUP lexicographers, enhancements
to the parsed DIMAP NODE dictionaries will be available during the licensing
period.
Back
This document maintained by Kenneth Litkowski
ken@clres.com
.
Material Copyright © 2001 CL Research