DIMAP Alphabetic WordNet Dictionary

CL Research has created an alphabetic version of WordNet (from version 1.5 through 3.0) for those interested in creating their own customized dictionaries using the WordNet distribution. For WordNet 3.0, the dictionary contains 146,973 entries (58 MB) and the associated heads dictionary contains 33,967 entries (4.4 MB). Use of these dictionaries requires CL Research's Dictionary Maintenance Programs (DIMAP). A demonstration version of DIMAP is available at CL Research. The Cognitive Science Laboratory at Princeton University site provides more details on the WordNet project

If you have not downloaded the latest DIMAP Alphabetic WordNet, follow this link.

These dictionaries can be opened directly in DIMAP (use File|Open Dictionary and find the directory where you have unzipped wn.dct or heads.dct). Optionally, either dictionary can be established as an auxiliary dictionary under File|Preferences and individual words can be accessed with Ctrl-A or Ctrl-W.

Since these dictionaries are in DIMAP format, they may be modified at will and converted wholly or partially into subsets of the WordNet data.  A Document Type Definition (DTD) and an XML Schema have been prepared for the conversion of the complete DIMAP versions of WordNet into XML, using the template file DMP2XML.TMP (available in the DIMAP demo files). This template can be customized to a user's needs (see Instructions for Using a Template in DIMAP on how to accomplish this)


Conversion to DIMAP

The creation of the DIMAP alphabetic WordNet dictionaries was performed within DIMAP, using a batch download process to interpret the WordNet files. A Perl script was to used to create a list of words and phrases for each letter of the alphabet from the WordNet sense.idx or index.senses file. Each list was then used to create the corresponding DIMAP dictionary entry for each word or phrase in WordNet; the creation of the full dictionary takes less than an hour. See Principles for Converting WordNet Entries to DIMAP Format for details on the conversion. A heads dictionary has also been created from WordNet multiword units based on the first and last words of these entries. See Creation of WordNet Head Dictionary for details of this creation. See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


Principles for Converting WordNet Entries to DIMAP Format

Conversion begins by creating a list of all the synsets associated with a word or phrase. Each WordNet synset is identified from the WordNet index files (index.* or *.idx) for use in creating a DIMAP sense in an entry for the word or phrase that is being processed.

Processing the WordNet data records (data.* or *.data) converts almost all information from the synset, pointers, verb frames, and gloss into DIMAP counterparts. Any phrases in a WordNet synset are stripped of underscores. For pointers, this includes processing of the pointer symbol, the synsets and files to which each symbol is linked, and the from/to field for the pointer. Processing of frame information for verbs includes the frame numbers and the synset members to which they apply. All of the gloss is extracted from the WordNet record and processed.

See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


Creation of a WordNet Entry

Conversion of WordNet data into a DIMAP entry begins with the creation of DIMAP header information. Each word or phrase (with underscores removed) becomes the basis for a DIMAP entry (that may have multiple senses). The word or phrase becomes the DIMAP entry; it is treated as a regular entry, given a definition type 'r'. Each entry is given a unique identifier (a code number) used in the DIMAP btree lookup system; this identifier consists of the first letter of the entry and a five-digit number. After creating the header, each WordNet data record is converted into a DIMAP sense. (A sense is created for each record that has the same lowercased word form.)

The category (part of speech) of the DIMAP sense is based on the WordNet data file for the synset ("nou" for synsets from the WordNet noun file, "vrb" for verbs, "adj" for adjectives, and "adv" for adverbs). Each DIMAP sense in a part of speech is given a unique identifier in a DIMAP feature with the name id and a value equal to the WordNet file number multiplied by 100 and added to the WordNet sense number. (This id number is a unique identifier and corresponds to a unique sense in WordNet.) After assigning a category, special consideration is given to WordNet adjectives, which are subdivided into descriptive (one that ascribes a value of an attribute of a noun) and relational (adjectives that do not relate to an attribute) subfiles.  This information is embodied in a feature in the DIMAP sense; based on the WordNet file, an adjective is given either a descriptive or a relational feature attribute and a value for this attribute of +.

Most of the WordNet data consists of relations with other words; this information is entered in DIMAP superconcepts, instances, and roles. The four principal types of information in a WordNet data record are:
  • Synset information
  • Hypernym, hyponym, and other semantic relations
  • Verb frame identification
  • Gloss (including usage label, usage note, definition, and examples)

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


    Conversion of WordNet Synset Information

    The principal relation in WordNet is the synonym relation, captured in the very meaning of a synset (synonym set).  In DIMAP, the role syn is established with links to all members of the entry's WordNet synset. (Note that the only trace of the original WordNet synset in DIMAP is the combination of the entry token and the tokens in the links for the syn role.) Part of the DIMAP data structure for roles includes an identification of specific senses to which a link has been made; the sense number that appears is the id number for the appropriate synset. 

    For adjectives in the WordNet files, a synset member may be annotated with a syntactic marker indicating a limitation on the syntactic position the adjective may have in relation to a noun that it modifies. This syntactic marker appears in the WordNet files as an immediately-concatenated parenthesized string to the synset member and is removed in establishing the link entry. If such a syntactic marker is attached to the synset member for which the DIMAP entry is being created, a DIMAP feature is added for this sense, with the feature attribute predicative for 'p', "attributive" for 'a', and "postnominal" for 'ip' and the feature value +

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


    Conversion of WordNet Pointer Information

    Each WordNet pointer associated with the current WordNet synset gives rise to a DIMAP role (used to encode semantic relations), with DIMAP links to the members of the synset to which the WordNet pointer is linked.  (Note that some pointers associated with a synset in the WordNet data are limited to only one member of the current synset, as indicated in the from/to field of the WordNet data; if the semantic relation identified by a pointer does not apply, as indicated by the from field, to the synset entry for which a DIMAP entry is being created, no DIMAP role is created.  Similarly, if the to field indicates that the relation does not apply to the synset member being added as a link, the link is not added to the role-link set that is being created.)  WordNet hypernym and hyponym pointers are treated differently, since DIMAP uses the specific superconcept and instance fields for these relations.  In these cases, the WordNet hypernym and hyponym pointers are converted to DIMAP superconcept and instance links.

    For the other types of pointers, a specific DIMAP role is created; the type of role may be specific to the WordNet part of speech:
  • Antonyms (all parts of speech): "ant".
  • Member meronym (nouns): "mem-of"
  • Substance meronym (nouns): "subst-of"
  • Part meronym (nouns): "part-of"
  • Member holonym (nouns): "has-mem"
  • Substance holonym (nouns): "has-subst"
  • Part holonym (nouns): "has-part"
  • Attribute (nouns): "has-attr"
  • Entailment (verbs): "ent"
  • Cause (verbs): "causes"
  • Also see (adjectives, verbs): "rel"
  • Similar to (head cluster) (adjectives):  "sats"
  • Similar to (satellite cluster) (adjectives):  "sat-of"
  • Pertainym (adjectives): "pert-to"
  • Attribute (adjectives): "attr-of"
  • Derived from adjective (adverbs): "der-from"
  • Domain of synset (CATEGORY) (all): "dom-cat"
  • Domain of synset (REGION) (all): "dom-reg"
  • Domain of synset (USAGE) (all): "dom-usage"
  • Participle of verb (adjectives): "part-of-verb"
  • Verb group (verbs): "verb-group"
  • Derivationally related forms (verbs, nouns): "der-form"
  • Domain member (CATEGORY) (nouns): "dom-mem-cat"
  • Domain member (REGION) (nouns): "dom-mem-reg"
  • Domain member (USAGE) (nouns): "dom-mem-usage"
  • Instance of (nouns): "instance-of"
  • Has instance (nouns): "has-instance"

    If a WordNet data record has more than one case of the same pointer type, the links are combined into one DIMAP role.

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.



    Conversion of WordNet Verb Frame Information

    If the WordNet synset is for a verb, it may have an associated set of verb frames in which it can be used.  These frames are identified by creating various DIMAP features, depending on the frames.  Most verb frames give rise to a type feature name, usually with a value equal to vt (transitive), vi (intransitive), ditr (ditransitive), or cop (copular).  All verb frames give rise to a TSubj (typical subject) feature with value equal to somebody, something, it, or body part.  Many verb frames lead to a TObj (typical object) feature with value equal to somebody or something.  Verbs with an object also have a with feature whose value is obj; other frames have with values of infinitive, complement, two objs (two objects), presp (present participles), adverbial, and clause.  Verb frames with particular kinds of adjuncts have a clue feature (corresponding to typical collocation patterns), with values beginning with a tilde ('~') intended to denote the target word, followed by "adj", "NP to NP", "NP from NP", "NP with NP", "NP of NP", "NP on NP", "NP to inf", (infinitive), "to inf", "whether inf", "to NP", "on NP", and "NP into".  One verb frame has the pattern "possNP NP ~" (a possessive noun phrase, followed by another noun phrase, followed by the target word).

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


    Conversion of WordNet Gloss Information

    If a WordNet sense has a gloss, it is analyzed into component parts for inclusion in the DIMAP sense, specifically, into a status or usage label, a definition, a usage note, and illustrative uses (coded in the feature field) of a DIMAP sense.  A usage label is indicated by a left parenthesis in the first position of a WordNet gloss; whatever is included in the part of the gloss beginning with this parenthesis up to a right parenthesis is included in the usage label field of a DIMAP sense.  If the gloss (after any initial usage label) begins with the word "used", this is construed as a usage note (describing how the word is used, rather than defining it) and any text up to a terminating colon or semicolon is placed in the usage note field of a DIMAP sense.  If this text contains single-quoted material, usually identifying a particle or preposition following the word, a DIMAP feature with is created with the value equal to what is inside the single quotes.  Any quoted strings in the gloss are interpreted as possible illustrations of the use of the current entry.  Each opening quotation mark is paired with a closing quotation mark; then, a search is made to determine whether the token for the current entry occurs exactly as a substring of the quotation.  If so, a feature attribute-value pair is created for the current DIMAP sense, with the feature attribute equal to the string "ex" and the quoted string as the feature value.  Any remaining text in the WordNet gloss is treated as the definition of the entry in this sense and is put into the definition field of the current DIMAP sense.  Since the WordNet glosses are intended only as reminders or notes for the lexicographers, many of them do not quite follow the patterns used for analysis, leading to poor DIMAP conversions.

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


    Creation of WordNet Head Dictionary

    A heads dictionary has also been created from WordNet multiword units (MWUs) based on the first and last words of these entries. This was done using two Perl scripts, one to identify heads of multiword units and the other to collect the information together for each head.

    Head analysis was based on the part of speech of an entry, as identified in the WordNet sense index file (index.sense or sense.idx).
  • Nouns with underscores (part of speech "%1"): Two entries were created: (1) an entry for the last word, with the full MWU as a DIMAP instance and (2) an entry for the first word with a DIMAP feature consisting of an attribute name kind and an attribute value consisting of a placeholder "~" followed by the remainder of the MWU. (E.g., abrading_stone leads to the entry stone with an instance abrading stone and to an entry abrading with a feature kind = ~ stone.)
  • Nouns with hyphens (part of speech "%1"): A DIMAP entry was created for the word following the hyphen, with the full noun as a DIMAP instance. (E.g., about-face leads to the entry face with an instance about-face.)
  • Verbs with underscores (part of speech "%2"): A DIMAP entry was created for the first word, with a DIMAP feature consisting of an attribute name kind and an attribute value consisting of a placeholder "~" followed by the remainder of the MWU. (E.g, abide_by leads to an entry abide with a feature kind = ~ by.)
  • Adjectives with underscores or hyphens (part of speech "%3" or "%5"): An entry was created for the last word in the MWU or the part after the hyphen, with the full MWU as a DIMAP instance. (E.g., able-bodied leads to the entry bodied with the instance able-bodied and ad_hoc leads to the entry hoc with an instance ad hoc.)
  • Adverbs with underscores (part of speech "%4): An entry was created for the last word in the MWU with the full MWU as a DIMAP instance. (E.g., above_all leads to the entry all with an instance above all.)

    The head analysis script produces a file with a set of lines suitable for upload into DIMAP. However, since many entries in this file have the same headword, another Perl script is run to "collect" entries with the same headword. Since the processing above produces two types of information for each entry, DIMAP instances and DIMAP features, this script merely collects the two types for each entry, prior to uploading the full set into the heads DIMAP dictionary.

    See Dictionary Entry Sense Display for the DIMAP interface into which WordNet data is placed.


    DIMAP Entry Sense Display

    The screen shot below shows the fields of a DIMAP entry. The relevant fields used in the creation of WordNet entries are: Entry, Code Number, Entry Type, Category, Definition, Usage Note, Usage Label, Superconcepts, Instances, Features (for a sense identifier, adjective type, adjective syntax, and verb properties) and Roles (synsets and all other types of semantic relations) Follow the links to see how WordNet data has been used to populate these fields.




    Instructions for Using a Template in DIMAP

    In DIMAP, select Extract Subdictionary under the Search/Analysis menu item (or Ctrl-E). In the dialog that appears, check Output matches to file. Under Output Template, select User Specified and push the button Template File to select dmp2xml.tmp. Push the button Output file to give a name to the file where you want the results (usually with an XML extension). Next, make an Extraction Specification, either All words/senses or Match pattern. Finally, push the Extract button. If you selected Match pattern, a dialog will ask you to specify the pattern; follow the instructions in the help file to enter patterns in the various fields (including regular expressions); push the OK button on this dialog to make start the search and generate the output. The progress bar will show you the progress. For the entire WordNet dictionary, this may take a few minutes depending on the speed of your machine; conversion of the entire file will also be quite large, over 800 MB. When you have completed the extraction, close the dialog. You will need to open the output in a simple editor (such as WordPad) to wrap the entire file in an opening and closing tag (such as <entries> and </entries>). You may also need to change all simple ampersands (&) to &amp; to be XML compliant.


    Feedback

    To report any bugs, request new or enhanced features, obtain product help or documentation, ask a question, make a comment, or request further information from CL Research, send feedback to CL Research.