ACL SIGLEX Resource Links
Special Interest
Group on the Lexicon of the
Association for
Computational Linguistics
Phonetic Databases
Lists of words coded phonetically
- The English Pronouncing Dictionary is available for research or commercial
development. See http://uk.cambridge.org/elt/reference/data.htm
for licensing information. This is Cambridge's most popular data product, used
by most leading players in the speech recognition software world. A new (16th)
edition of the book is to be published later this year with for the first time
a CD-ROM including sound recordings for all headwords. These sound recordings
(approximately 70,000 in British English and 35,000 in American English) are
also available for licensing.
- Moby Pronunciator: 175,000 entries fully International Phonetic Alphabet
coded mpron.tar.Z
[3.1MB]
- Carnegie Mellon
pronunciation dictionary, phonemic transcriptions of 100,000 words with
American English pronunciation, available for download by anonymous ftp.
Transcriptions of speech samples
- Available from ICAME
Corpora: Corpora available from ICAME (International Computer
Archive of Modern and Medieval English) in Bergen, Norway.
- The London-Lund Corpus: Approximately 500,000 words of spoken British
English. Various dates from 1960s to mid 1970s. Prosodically annotated version
only.
- Treebanks available from the Lancaster University Centre for Computer
Corpus Research on Language (UCREL):
- The Lancaster/IBM Spoken English Corpus (SEC): Approximately 53,000 words
of British spoken English, mainly taken from radio broadcasts dating from the
mid 1980s. Available as orthographic text, tagged with the CLAWS2
part-of-speech tagging system, parsed, and prosodically annotated. There are
also tapes of a standard suitable for the instrumental analysis of F$_{0}$
values.
- Available from the Linguistic Data
Consortium: (A new service from the Linguistic Data Consortium: LDC-Online
is a new search and retrieval service, offering convenient WWW access to the
text and speech corpora of the Linguistic Data Consortium (LDC). For more
detailed information, or to try it out, see the LDC-Online item on the LDC's
home page (http://www.ldc.upenn.edu). All LDC text and speech resources are
indexed for convenient online access, as long as no copyright or other
restrictions prevent it. You can browse (where the copyright owner permits), or
search by word, lemma or part-of-speech, or search with (limited) regular
expressions combining these elements. Statistics such as word frequency and
mutual information between words are also available. Retrieved speech can be
displayed or played via a Java applet (for users with Java-aware browsers). A
Netscape "helper application" is also available for transferring
speech to other programs such as Entropic's waves+. LDC-Online is free to
researchers at current LDC member institutions. An interactive tutorial is
available to members and non-members alike, as is a guest account permitting
access to the Brown text corpus and the TIMIT speech corpus.)
- The Resource Management-Word Data Continuous Speech Database (RM1),
Isolated and Spelled Word Data, This CD-ROM contains previously-unreleased
isolated-word and spell-mode (spelled out words) speech data from the (D)ARPA
Resource Management (RM1) Corpus. This data is based on a 600-word subset of
the 991-word RM1 vocabulary and contains spoken and spelled words pertaining to
the RM1 naval resource management task. This corpus was collected
simultaneously as part of the RM1 Continuous Speech Corpus (NIST Speech Discs
2-1-2-4) and contains speech from the same sets of subjects used in RMI.
- The Air Traffic Control Corpus (ATC0). This corpus is composed of
approximately 70 hours of voice communication traffic between controllers and
pilots.
- Mandarin Chinese databases: Mandarin Chinese News Text, Mandarin Telephone
Speech, Mandarin Lexicon
- DCIEM SLEEP DEPRIVATION STUDY: MAP TASK DIALOGUES, This set of CD-ROMs
contains the materials used to collect all 216 spoken dialogues digital audio,
orthographic transcriptions, documentation, and source code for tools. The
dialogues were selected to provide balanced representation at different points
in a sleep deprivation experiment.
- The Trains
Dialogue Corpus, which was released by the LDC last April, is now partially
available online. Word transcriptions for each dialogue are available. The
corpus contains 6 and a half hours worth of human-human dialogues, which
includes 55,000 words and about 5500 speaker turns. Audio files for the
dialogues are available on the CD-ROM.
To SIGLEX Resources Main
Page