ACL SIGLEX Resource Links
Special Interest
Group on the Lexicon of the
Association for
Computational Linguistics
Corpora
Corpora are intended to be representative of some specified population or
genre. Corpora are needed for large scale, systematic contrasts of, for
example, language varieties, genres, and modalities (e.g., American vs. British
English, informative vs. imaginative prose, or spoken vs. written language).
Other research requires enormous amounts of data, even if from fewer genres, as
for example, in lexicography, in order to detect words and collocations which
occur only rarely. These are recognized corpora used in research; this list
does not attempt include all text collections. Also see parallel corpora (involving aligned texts of two or
more languages).
Languages
English
- ICAME Corpora: Corpora
available from ICAME
(International Computer Archive of Modern and Medieval English) in Bergen,
Norway.
- The Lancaster/Oslo-Bergen Corpus (LOB): Approximately 1,000,000 words of
British written English dating from 1960. The corpus is made up of 15 different
genre categories. Available as orthographic text and tagged with the CLAWS1
part-of-speech tagging system. The Leeds-Lancaster Treebank and Lancaster
Parsed Corpus are analyzed subsamples of the LOB corpus. (See also SIGLEX
Treebanks.)
- The Brown University Corpus: Approximately 1,000,000 words of American
written English dating from 1960. The genre categories are parallel to those of
the LOB corpus.
- The Helsinki Corpus (Diachronic Part): samples from texts covering the Old,
Middle, and Early Modern English periods. 1,500,000 words in total.
- Melbourne-Surrey Corpus: 100,000 words of Australian newspaper texts.
- The Kolhapur Corpus: Approximately 1,000,000 words of Indian written
English dating from 1978. Again, the genre categories are parallel to those of
the LOB corpus.
- The Helsinki Corpus of Older Scots: 830,000 words from 1450 - 1700 covering
15 prose genres.
- The British National Corpus,
extracts from 4124 modern British English texts of all kinds, both spoken and
written, available for academic research to those located in a member state of
the EU. Each text is segmented into orthographic sentence units, and each word
automatically assigned a part of speech code. There are 6 and a quarter million
sentences, and over 100 million words.
- The Cambridge International Corpus at
http://uk.cambridge.org/elt/corpus.
- The International
Corpus of English (ICE) is collecting corpora in 20 countries and regions
(Australia to Zambia). Each participating country is collecting, computerizing,
and analyzing a corpus of one million words of their own national or regional
variety of English, spoken or written between 1990 and 1996. Each team is
following a common corpus design, as well as a common scheme for grammatical
annotation. ICE incorporates ICLE, the International Corpus of Learner English.
- For information (academic queries) on the Wellington Corpus of Written New
Zealand English (parallel to LOB), the New Zealand ICE Corpus, the Wellington
Corpus of Spoken New Zealand English (1 million words) and other smaller
collections of data, covering a number of styles, and speakers of different
ethnicities, genders, ages, etc., allowing lexical, grammatical and phonetic
analysis, please contact Laurie
Bauer or Paul Warren.
- The Longman-Lancaster Corpus: Approximately 14.5 million words of written
English from various geographical locations in the English-speaking world and
of various dates and text types.
- Moby Shakespeare: The complete unabridged works of Shakespeare
mshak.tar.Z
[2.3MB]
Dutch
Portuguese
- The Corpus do Português allows you to quickly and easily search more than 45 million words in more than 50,000 Portuguese texts from the 1300s to the 1900s. The interface allows you to search for exact words or phrases, substrings, lemmas, part of speech, or any combinations of these. You can also search for surrounding words (collocates) within a ten-word window.
he corpus also allows you to easily compare (and see, via charts) the frequency of and distribution of words, phrases, and grammatical constructions across texts, in at least three ways:
- -- By register: comparisons between spoken, fiction, newspaper, and academic
- -- By dialect: Portugal compared with Brazil
- -- By historical period: compare different centuries from the 1300s to the 1900s
You can also easily carry out semantically-based queries of the corpus. For example, you can compare and contrast the collocates of two related words to determine the difference in meaning between these words. You can find the frequency and distribution of synonyms for more than 20,000 words and also compare their frequency in different registers, countries, and historical periods, and use these word lists as part of other queries. Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.
To SIGLEX Resources Main
Page