ACL SIGLEX Resource Links
Special Interest Group on the Lexicon of the
for Computational Linguistics
Databanks of text containing part of speech tags and labeled constituent structures (e.g., noun
phrase, adverbial phrase, coordinate clause).
Treebanks available from the Lancaster University Centre for Computer Corpus Research on Language (UCREL):
ICAME Treebanks: Treebanks available from ICAME (International Computer Archive of Modern and Medieval English) in Bergen, Norway.
- The Lancaster-Leeds Treebank: A manually parsed subsample of the LOB corpus showing the surface phrase structure of each sentence. Approximately 45,000 words taken from all the genre categories of the LOB corpus.
- The Lancaster Parsed Corpus (LPC): A subsample of the LOB corpus, parsed by computer and manually corrected by several researchers. Approximately 140,000 words with samples from each of the 15 categories in the LOB corpus. (Also available at ICAME.)
- The American Printing House for the Blind Treebank (APHB): A skeleton-parsed corpus of a wide range of English texts. 200,000 words.
- The Associated Press Treebank (AP): A skeleton-parsed corpus of American newswire reports. 1,000,000 words.
- The Canadian Hansard Treebank: A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words.
- The IBM Manuals Treebank: A skeleton-parsed corpus of computer manuals. 800,000 words.
- The Anaphoric Treebank: A subsample of the AP corpus, annotated to show the reference of pronouns and lexical cohesion. Approximately 100,000 words.
- The Market Research Corpus: A corpus of approximately 1,500,000 words of in-depth market research interview transcripts (from the ACAMRIT project). The data have been tagged for part of speech and word sense, but only about 10% of the corpus has been manually examined.
Two morphologically analyzed and disambiguated Turkish texts (about 12,000 words) are now available online. The morphological parse is presented in a hierarchical fashion with the inflectional features after the last derived form shown at the top-most level, and the nesting levels indicating the derivations in the lexical form. The disambiguation process also preprocesses the morphologically analyzed to group all lexicalized and non-lexicalized collocations. Send any comments and/or corrections to Kemal Oflazer (http://www.cs.bilkent.edu.tr/~ko/ko.html), Bilkent University Computer Engineering Department, Bilkent, ANKARA, 06533 TURKIYE
- LOB Corpus (See description at Corpora.)
- The Lancaster Parsed Corpus (LPC) (See description above.)
To SIGLEX Resources Main Page