Macquarie-CL Research Collaboration

Contents of Big Mac, the Macquarie Thesaurus and Big Mac+DIMAP

Machine-Readable Version (MRD) of The Macquarie Dictionary

The MRD version of The Macquarie Dictionary comprises 64.6 MB of data (see sample--right click on document to print), marked up with tags. The distribution includes a full description of these tags (as well as their disposition when converted to DIMAP). A headword, flagged as either standard or encyclopedic, is divided into chunks for each part of speech and subhead (idiomatic phrase). Each contains a part of speech label, pronunciation, inflections, head modifiers, grammatical notes, any subheads, variants, subject and other labels, and a definition field. The definition field contains either a definition proper or a cross-reference, and optionally may contain other information such as context, illustrative examples, geographic information (for encyclopedic place entries), scientific information, abbreviations, etymologies, links to The Macquarie Thesaurus, and additional labels not appearing in the print dictionary. Many entries also have undefined runons (derivatives of the head), each of which may have associated variants, pronunciations, and thesaurus links.

Machine-Readable Version (MRD) of The Macquarie Thesaurus

The Macquarie Thesaurus is a 5.2 MB text file, with a line for each category name, for each paragraph (grouped by part of speech, usually with several paragraphs for each part of speech), and for each word in each subparagraph. An inverted index file of 5.2 MB is structured like a WordNet index file, identifying all (category, paragraph, subparagraph) triples in which an entry appears, along with its part of speech. A set of perl scripts used to generate the index file is included in the distribution; C++ code making use of a highly compressed version of this index for lexical chaining is also provided. (Stephen Green developed the perl scripts and C++ code and has graciously made it available.)

CL Research and DIMAP Enhancements of The Macquarie Dictionary

The CL Research enhancements of Big Mac consist of (1) the conversion program used to transform Big Mac data into a format suitable for upload to DIMAP dictionaries, (2) DIMAP itself, and (3) a set of DIMAP dictionaries (about 140 MB in uncompressed form) that provide the Big Mac data in a machine-tractable and viewable format, in which the Big Mac definitions have been parsed and semantic links created between the entries.

Conversion Program
The source code (in C++) by which Big Mac data were formatted for upload to DIMAP dictionaries is provided. While this program is specific to the creation of DIMAP dictionaries, it can easily be modified to remove the output lines and maintain the structural analysis of the Big Mac data.
DIMAP
The Big Mac+DIMAP distribution includes DIMAP itself, as well as any enhancements made during the licensing period. DIMAP not only enables the examination of entries sense by sense, but also provides a substantial range of functionality to manipulate and examine the dictionary data, including the parsing of definitions. In addition, DIMAP includes the functionality used by CL Research in word-sense disambiguation (Senseval) and question-answering (TREC).
Machine-Tractable and DIMAP-Parsed Version of Big Mac
The DIMAP version of Big Mac provides a GUI interface to the Big Mac data and separates the data into distinct elements (generally in attribute-value pairs). Of particular note is that the conversion program creates headword entries for all variants, derivations, and phrasal entries. With the DIMAP functionality for regular-expression searching on any field and for extracting data following a user-specified template, the Big Mac data becomes machine-tractable. In addition to the core dictionary, three complementary DIMAP dictionaries based on the Big Mac data are also provided:
Although the conversion program is now relatively stable, changes will occur in these DIMAP dictionaries as even more data is mined during the licensing period.
Parsing the definitions of Big Mac created semantic links, including hypernyms or superordinates, typical subjects and objects of verbs, synonyms, location, goal, purpose, manner, member-of, has-parts, and others. With DIMAP's primitive-finding digraph algorithm, it is possible to analyze the semantic network for parts of speech and subdictionaries (such as thesaurus groupings). With CL Research's commitment to improving definition parsing and the use of this information in word-sense disambiguation and question-answering, and our relationship with Macquarie lexicographers, enhancements to the parsed DIMAP Big Mac dictionaries will be available during the licensing period.

Back

Maintained by Ken Litkowski .
Copyright © 2002 CL Research