ACL SIGLEX Resource Links
Special Interest Group on the Lexicon of the
Association
for Computational Linguistics
Parallel Corpora
(9/29/96)
Running text in two or more languages.
Parallel corpora available from the Lancaster University Centre for Computer Corpus Research on Language (UCREL):
- The ET10-63 Corpus, a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Currently the two sub-corpora are held separately, but will soon be loaded into a database package (INGRES) allowing retrieval of parallel bilingual text (McEnery and Daille, 1993). Approximately 1,250,000 words of each language.
Sample Turkish and English texts, automatically aligned (by Kursat Ince) at the sentence level using Gale and Church's align code. Send any corrections and suggestions to Kemal Oflazer.
PEDANT, the parallel texts in Göteborg. PEDANT consists of texts in several languages and aims at providing a wide collection of text types and language pairs in order to facilitate the creation of sub-set corpora for the specific purposes various researchers might have. Developed by Pernilla Danielsson and Daniel Ridings. Searches, resulting in something that could be likened to a parallel concordance, can be done in Swedish, English, French and German.
Regeringsförklaringen is the yearly declaration of the Swedish government and is issued in Swedish, English, French, German, and Spanish. The documents have been converted to TEI-conformant SGML and the sentences in the different language issues have been aligned with the align program by Gale and Church. The result is this searchable parallel corpus. Contact Erik Tjong Kim Sang for further details.
To SIGLEX Resources Main Page