Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
Line 8: | Line 8: | ||
<!-- Please keep this list in alphabetical order --> | <!-- Please keep this list in alphabetical order --> | ||
+ | *[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus] | ||
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | *[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | ||
*[http://americannationalcorpus.org/ American National Corpus (ANC)] | *[http://americannationalcorpus.org/ American National Corpus (ANC)] | ||
Line 78: | Line 79: | ||
*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme] | *[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme] | ||
− | + | ==Arabic== | |
− | |||
− | |||
− | |||
*[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1] | *[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1] | ||
− | + | ==Bosnian== | |
*[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts] | *[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts] | ||
− | + | ==Bulgarian== | |
*[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian] | *[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian] | ||
− | + | ==Croatian== | |
*[http://riznica.ihjj.hr/en/ Croatian Language Corpus at the IHJJ] | *[http://riznica.ihjj.hr/en/ Croatian Language Corpus at the IHJJ] | ||
− | + | ==Czech== | |
*[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus] | *[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus] | ||
− | + | ==Danish== | |
*[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus] | *[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus] | ||
− | + | ||
− | + | ==Finnish== | |
− | |||
− | |||
− | |||
− | |||
*[http://www.csc.fi/kielipankki/ Finnish text bank] | *[http://www.csc.fi/kielipankki/ Finnish text bank] | ||
− | + | ==French== | |
*[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais] | *[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais] | ||
− | + | ==German== | |
*[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts] | *[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts] | ||
*[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)] | *[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)] | ||
− | + | ==Haitian Creole== | |
*[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS] | *[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS] | ||
− | + | ==Italian== | |
*[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers] | *[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers] | ||
− | + | ==Japanese== | |
*[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs] | *[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs] | ||
− | + | ==Polish== | |
*[http://korpus.pl/en/ IPI PAN Polish Corpus] | *[http://korpus.pl/en/ IPI PAN Polish Corpus] | ||
− | + | ==Romanian== | |
*[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP] | *[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP] | ||
− | + | ==Sanskrit== | |
*[http://sanskritlibrary.org/ Sanskrit Library] | *[http://sanskritlibrary.org/ Sanskrit Library] | ||
− | + | ==Slovenian== | |
*[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus] | *[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus] | ||
− | + | ==Spanish== | |
*[http://www.corpusdelespanol.org/ Corpus del Espanol] | *[http://www.corpusdelespanol.org/ Corpus del Espanol] | ||
*[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular] | *[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular] | ||
− | + | ==Swahili== | |
*[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)] | *[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)] | ||
− | |||
+ | ==Uncategorized== | ||
*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus] | *[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus] |
Revision as of 19:23, 24 April 2008
For languages other than English, see List of resources by language.
English
- 1963 Time Magazine corpus
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- Collins Wordbanks
- Congressional floor-debate transcripts, with support/oppose labels
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- GOV2 Corpus - 426 gigabytes of text
- Gutenberg
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- ICAME
- Large Text Compression Benchmark's 1G sample of Wikipedia
- List of English stopwords
- Movie Review Data
- Multi-Perspective Question Answering (MPQA)
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- WebCorp
Slovak
Italian
- LIP - Lessico di frequenza dell'Italiano Parlato - Access via BADIP
- ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto
- Corpus di Italiano Scritto contemporaneo (CORIS/CODIS)
- Tesoro della lingua italiana delle origini (TLIO)
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Isabella Chiari: Corpora, Software and Linguistic resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme
Arabic
Bosnian
Bulgarian
Croatian
Czech
Danish
Finnish
French
German
- A Syntactically Annotated Corpus of German Newspaper Texts
- Experimental Corpus Query System (University of Stuttgart, Germany)
Haitian Creole
Italian
Japanese
Polish
Romanian
Sanskrit
Slovenian
Spanish
Swahili
Uncategorized
- 2000 NIST Speaker Recognition Evaluation Corpus
- A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)
- Alpino Treebank
- AOT
- Corpus Resources (Chulalongkorn University, Thailand)
- Cranfield collection
- CREA
- Edinburgh Associative Thesaurus (EAT)
- EuroWordNet
- Hansards Corpus - Searchable
- HCRC Map Task Corpus XML annotations
- ICOPOST
- IMS Corpus Toolbox, Univ. of Stuttgart
- IMS Corpus Workbench (CWB)
- International Corpus of Learner English
- Kiel University's Institute on Phonetics and Speech Procesing
- Lacio Web Corpora
- LANGUAGE LEARNING CENTER - ACADEMIC CORPUS
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Medlars collection
- Miscellaneous Word Lists from Oxford University
- Multilingual Text Tools and Corpora
- Name lists from US census
- Nexing Corpus
- On-line books at CMU
- OPUS -- An Open Source Parallel Corpus
- Polish subcorpus of the International Corpus of Learner English
- Ramon Piero Center for Research
- Reuters Corpus
- Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
- Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio
- Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)
- Survey of English Usage, University College, London
- Switchboard Transcription Project
- TELRI Research Archive of Computational Tools and Resources
- The Childes Corpus - Children's language
- The CORPORA DataCenter (Norway)
- The Moby Corpus
- The Sofie Treebank - A Parallel Treebank of North European Languages