Contents of Big Mac, the Macquarie Thesaurus and Big Mac+DIMAP
Machine-Readable Version (MRD) of The Macquarie Dictionary
The MRD version of The Macquarie Dictionary comprises 64.6 MB of data (see sample--right click
on document to print), marked up with tags. The distribution includes a
full description of these tags (as well as their disposition when converted to
DIMAP). A headword, flagged as either standard or encyclopedic, is divided into
chunks for each part of speech and subhead (idiomatic phrase). Each contains a
part of speech label, pronunciation, inflections, head modifiers, grammatical
notes, any subheads, variants, subject and other labels, and a definition
field. The definition field contains either a definition proper or a
cross-reference, and optionally may contain other information such as context,
illustrative examples, geographic information (for encyclopedic place entries),
scientific information, abbreviations, etymologies, links to The Macquarie
Thesaurus, and additional labels not appearing in the print dictionary. Many
entries also have undefined runons (derivatives of the head), each of which may
have associated variants, pronunciations, and thesaurus links.
Machine-Readable Version (MRD) of The Macquarie Thesaurus
The Macquarie Thesaurus is a 5.2 MB text file, with a line for each category
name, for each paragraph (grouped by part of speech, usually with several
paragraphs for each part of speech), and for each word in each subparagraph. An
inverted index file of 5.2 MB is structured like a WordNet index file,
identifying all (category, paragraph, subparagraph) triples in which an entry
appears, along with its part of speech. A set of perl scripts used to generate
the index file is included in the distribution; C++ code making use of a highly
compressed version of this index for lexical chaining is also provided.
(Stephen Green developed the perl scripts and C++ code and has graciously made
it available.)
CL Research and DIMAP Enhancements of The Macquarie Dictionary
The CL Research enhancements of Big Mac consist of (1) the conversion
program used to transform Big Mac data into a format suitable for upload to
DIMAP dictionaries, (2) DIMAP itself, and (3) a set of DIMAP dictionaries
(about 140 MB in uncompressed form) that provide the Big Mac data in a
machine-tractable and viewable format, in which the Big Mac definitions have
been parsed and semantic links created between the entries.
- Conversion Program
- The source code (in C++) by which Big Mac data were formatted for upload to
DIMAP dictionaries is provided. While this program is specific to the creation
of DIMAP dictionaries, it can easily be modified to remove the output lines and
maintain the structural analysis of the Big Mac data.
- DIMAP
- The Big Mac+DIMAP distribution includes DIMAP
itself, as well as any enhancements made during the licensing period. DIMAP not
only enables the examination of entries sense by sense, but also provides a
substantial range of functionality to manipulate and examine the dictionary
data, including the parsing of definitions. In addition, DIMAP includes the
functionality used by CL Research in word-sense disambiguation (Senseval) and
question-answering (TREC).
- Machine-Tractable and DIMAP-Parsed Version of Big Mac
- The DIMAP version of Big Mac provides a GUI interface to the Big Mac data
and separates the data into distinct elements (generally in attribute-value
pairs). Of particular note is that the conversion program creates headword
entries for all variants, derivations, and phrasal entries. With the DIMAP
functionality for regular-expression searching on any field and for extracting
data following a user-specified template, the Big Mac data becomes
machine-tractable. In addition to the core dictionary, three complementary
DIMAP dictionaries based on the Big Mac data are also provided:
- a "heads" dictionary, with 9,839 entries for the final noun or
final word in multiword or hyphenated Big Mac noun and adjective entries,
- a "base" dictionary with 63,197 entries, showing the base form of
all inflected forms, variants, and derivatives, and
- a "derivatives" dictionary with 33,174 entries, showing all words
derived from base forms, including inflected forms, etymologically derived
forms, and variants.
- Although the conversion program is now relatively stable, changes will
occur in these DIMAP dictionaries as even more data is mined during the
licensing period.
-
- Parsing the definitions of Big Mac created semantic links, including
hypernyms or superordinates, typical subjects and objects of verbs, synonyms,
location, goal, purpose, manner, member-of, has-parts, and others. With DIMAP's
primitive-finding digraph algorithm, it is possible to analyze the semantic
network for parts of speech and subdictionaries (such as thesaurus groupings).
With CL Research's commitment to improving definition parsing and the use of
this information in word-sense disambiguation and question-answering, and our
relationship with Macquarie lexicographers, enhancements to the parsed DIMAP
Big Mac dictionaries will be available during the licensing period.
Back