ACL SIGLEX Resource
Links
Special Interest
Group on the Lexicon of the
Association for
Computational Linguistics
Electronic Dictionaries
(8/13/04)
Organization and Definitions of Electronic Dictionaries
The primary purpose of this page is to provide links to lexical data that is
publicly available and can be downloaded for use in other systems. We try to
identify data that is freely available and useful for research purposes, and
note where license agreements are required.
The section Lists and files with linguistic
information requires more than simple word lists. Some piece of
linguistically-useful information must be attached to individual words.
Minimally, this can include frequency counts (based on some corpus), part of
speech, or grouping of words according to some linguistic principle.
Preferably, a list will have many types of data associated with each lexical
item, such as pronunciation, definitions, usage notes and labels,
subcategorization patterns, feature names and values, morphological
information, semantic relations with other lexical items, and collocational
information.
The section Ontologies and semantic networks provides
links to publicly available lexical data containing links representing
relations between lexical items.
The sections Links to on-line lookup dictionaries and
thesauruses and Links to links to on-line lookup
dictionaries and thesauruses provide links to the many sources that can be
used to look up individual words, either monolingual or multilingual. These
sources typically do not allow any downloading of multiple words and provide
on-line interfaces for searching their databases.
The section Other major electronic dictionaries
identifies and provides links to commercially available dictionaries, typically
available on CD-ROM.
Lists and Files with Linguistic Information
- Brown Corpus Lexicon of 52,000 words annotated with the 35 Penn Treebank
part of speech tags (2.3 tags/word) for use with the Brill tagger, along with
(135 to 285) lexical and contextual rules "learned" from the Brown
Corpus and the Wall Street Journal Corpus. This lexicon accompanies and can be
separated from the supervised transformation-based tagger, available from
Eric Brill's home page.
- The XTAG Project (an
on-going project to develop a wide-coverage grammar for English using a
feature-based and lexicalized Tree Adjoining Grammar formalism) contains an
associated 300K+ word English lexicalized grammar.
- COMLEX
(COMmon LEXicon) is a monolingual English Dictionary consisting of 38,000 head
words, all of which are marked with a rich set of syntactic features and
complements, intended for use in natural language processing. Comlex is
available for both research and commercial use from the
Linguistic Data Consortium.
- Index from
English Verb Classes And Alternations: A Preliminary Investigation, by
Beth Levin
- The Oxford Text Archive (OTA)
has several machine-readable
dictionaries available, including English, German, Gaelic, Latin,
Serbo-Croat. These are listed and described, and are available via anonymous
ftp at ftp://ota.ox.ac.uk/pub/ota/public/.
- Cambridge
dictionary data on research or commercial development licences. Cambridge
dictionaries are designed for learners of English and thus use a restricted
defining vocabulary, many example sentences, and specifically indicate common
collocations and grammatical patterns, all of which make them particularly
appropriate for use in natural language processing and specifically word sense
disambiguation. In addition, the electronic data includes an integrated
semantic network and selectional preference patterns for all words and
meanings.
- 6318 most frequent lemmas
(forms equivalent to headwords in a dictionary) in the British National Corpus
(with frequencies and word class), developed by
Adam Kilgarriff
- Word frequency lists, for widely
available corpora, including the Brown corpus and the LOB corpus.
- The Moby lexicon project is complete and has been placed into the public
domain. Use, sell, rework, excerpt and use in any way on any platform. Placing
this material on internal or public servers is also encouraged. The compiler is
not aware of any export restrictions so freely distribute world-wide. You can
verify the public domain status by contacting Grady Ward, 3449 Martha Ct., Arcata, CA
95521-4884. The project is available from the University of Sheffield
site: either as a
complete
distribution [26MB] or as set of subprojects:
Ontologies and semantic networks
- WordNet®, developed
at the Cognitive Science Laboratory of Princeton University, is an on-line
lexical reference system (also available for download in Unix, PC Windows, and
Macintosh distributions) whose design is inspired by current psycholinguistic
theories of human lexical memory. English nouns, verbs, adjectives and adverbs
are organized into synonym sets, each representing one underlying lexical
concept. Different relations link the synonym sets. (An alphabetic version of
WordNet is available from CL Research.)
Development of WordNet-like databases in other languages has been initiated at
other sites, including:
- EuroWordNet: a multilingual
database with wordnets for several European languages (Dutch, Italian and
Spanish).
- GermaNet: a
lexical-semantic net that is being developed within the LSD Project at the
Division of Computational Linguistics of the Linguistics Department at the
University of Tübingen.
- The Dictionary Parsing Project
is parsing definitions from the 1913 Webster's (see below) to create ontologies
and semantic networks automatically. This technology is freely available at
CL Research and can be used to create
ontologies and semantic networks from dictionary definitions, usually in less
than a week.
- The CIDE+ related words coding scheme where the Cambridge International
Dictionary of English is coded into 2,500 "related words" sets. See
http://uk.cambridge.org/elt/cide/related_words.htm
for an example and http://uk.cambridge.org/elt/reference/data.htm
for licensing information.
- The MicroKosmos
project, Knowledge-Based Machine Translation and comprehensive treatment of
lexical, ontological, and text meanings in a "society of
microtheories" architecture, at New Mexico State University, now has an
on-line
ontology
browser.
- The Upper Cyc Ontology, approximately
3,000 terms capturing the most general concepts of human consensus reality.
- The Unified
Medical Language System Knowledge Sources, Metathesaurus, Semantic Network,
and other data available to those signing an experimental agreement.
Links to on-line lookup dictionaries and thesauruses
- A VERY preliminary implementation of
Webster's
Revised Unabridged Dictionary (G & C. Merriam Co., 1913) with a simple
search engine and hypertext cross-reference scheme running under WWW. The
database currently has about 110,000 entries. Free of copyright protection.
Patrick Cassidy, of
MICRA Inc., has
generously made the data available for public access as he did for the 1911
Roget's Thesaurus. Thus, while VERY DATED (which may in itself be of some
historical interest), the 1913 Webster's is at least a little bit better than
nothing for people who do not have access to modern, commercial databases.
- The Wordsmyth English
Dictionary Thesaurus provides a considerable set of information for each
lexical entry, including thesaural links, as well as links to WordNet.
- The free Cambridge Dictionaries Online service at
http://dictionary.cambridge.org.
- EDICTA,
the Early Dictionaries centre, provides: (1) the Early Modern English
Dictionaries Database (11 dictionaries from 1530 to 1657), (2) samples of three
medieval Latin-French dictionaries -- Glossarium Gallico-latinum,
Montpellier-Stockholm Catholicon and Vocabularius familiaris et compendiosus --
and (3) sample databases of Nicot's Thresor and of the 8 complete editions of
the Dictionnaire de l'Academie francaise.
- Sample Database of the
Dictionnaire de
l'Academie francaise, providing a 1% sample of articles taken from the
eight complete editions. The sample database is accessible on the World Wide
Web during development in order both to make it available to the public and to
invite the criticisms of members of the research community so that the project
teams may better prepare the global database. Developed by
Russon Wooldridge
Links to links to on-line lookup dictionaries and
thesauruses
Other major electronic dictionaries
- The second edition of the Oxford English Dictionary is available on CD-ROM
(Windows or
Mac) from:
Electronic Publishing Division, Oxford University Press, 200 Madison Avenue,
New York NY 10016; Tel (212) 679-7300, ext. 7370; or Electronic Publishing
Division, Oxford University Press, Walton Street, Oxford OX2 6DP; Tel: +44
(865) 267979; email OUPJSC@VAX.OXFORD.AC.UK. Also, check out their
online site.
- The Cambridge International Dictionary of English CD-ROM at
http://uk.cambridge.org/elt/cide
and the Cambridge Dictionary of American English CD-ROM at
http://www.cup.org/esl/cdae.
- The LDC has reached an agreement
with Merriam Webster, Inc. to distribute
electronic versions of Merriam Webster's Collegiate Dictionary, Tenth Edition
(full text), Biographical Dictionary, New Geographical Dictionary, Collegiate
Thesaurus, School Dictionary, and The Merriam-Webster Dictionary (mass-market
paperback) to the research community. Except for the Collegiate Thesaurus, each
lexicon will be available in two versions, full text and a version with
headwords, functional labels, and pronunciations.
- Le Robert
Electronique is the electronic version of the nine-volume English-French
dictionary by Robert Grant, De La Langue Franc,aise (1985 edition). It is
available on CD-ROM from Chadwyck-Healey Inc., 1101 King Street, Alexandria, VA
22314, USA; Tel: +1 (703) 683-4890 or +1 (800) 752-0515; FAX: +1 (703)
683-7589; or Chadwyck-Healey Ltd., Cambridge Place, Cambridge CB2 1NR, UK; Tel:
+44 (223) 311479; FAX: +44 (223) 66440.
To SIGLEX Resources Main
Page