Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
− | + | ''This list needs some cleaning. Please help.'' | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==English corpora== | ==English corpora== | ||
+ | |||
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | *[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | ||
*[http://americannationalcorpus.org/ American National Corpus (ANC)] | *[http://americannationalcorpus.org/ American National Corpus (ANC)] | ||
Line 167: | Line 101: | ||
*[http://korpus.juls.savba.sk/index.en.html Slovak National Corpus] | *[http://korpus.juls.savba.sk/index.en.html Slovak National Corpus] | ||
+ | |||
+ | |||
+ | ==Uncategorized== | ||
+ | |||
+ | *[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus] | ||
+ | *[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus] | ||
+ | *[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts] | ||
+ | *[http://ixa.si.ehu.es/Ixa/resources/sensecorpus A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)] | ||
+ | *[http://odur.let.rug.nl/~vannoord/trees/ Alpino Treebank] | ||
+ | *[http://www.cornelsen.de/international/ An Empirical Grammar of the English Verb System] | ||
+ | *[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL] | ||
+ | *[http://www.aot.ru/search1.html AOT] | ||
+ | *[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1] | ||
+ | *[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais] | ||
+ | *[http://thetis.bl.uk/ BNC Online Service] | ||
+ | *[http://bokrcorpora.narod.ru Bokr Russian Reference Corpus] | ||
+ | *[http://info.ox.ac.uk/bnc/ BRITISH NATIONAL CORPUS - WORLD EDITION] | ||
+ | *[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora] | ||
+ | *[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular] | ||
+ | *[http://www.corpusdelespanol.org/ Corpus del Espanol] | ||
+ | *[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian] | ||
+ | *[http://pioneer.chula.ac.th/~awirote/ling/corpuslst.htm Corpus Resources (Chulalongkorn University, Thailand)] | ||
+ | *[ftp://ftp.cs.cornell.edu/pub/smart/cran/ Cranfield collection] | ||
+ | *[http://corpus.rae.es/creanet.html CREA] | ||
+ | *[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus] | ||
+ | *[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus] | ||
+ | *[http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)] | ||
+ | *[http://www.hum.uva.nl/~ewn EuroWordNet] | ||
+ | *[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)] | ||
+ | *[http://www.csc.fi/kielipankki/ Finnish text bank] | ||
+ | *[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS] | ||
+ | *[http://rali.iro.umontreal.ca/ Hansards Corpus - Searchable] | ||
+ | *[http://www.hcrc.ed.ac.uk/maptask/ HCRC Map Task Corpus XML annotations] | ||
+ | *[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)] | ||
+ | *[http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/ ICOPOST] | ||
+ | *[http://www.ims.uni-stuttgart.de/projekte/TC.html IMS Corpus Toolbox, Univ. of Stuttgart] | ||
+ | *[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ IMS Corpus Workbench (CWB)] | ||
+ | *[http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm International Corpus of Learner English] | ||
+ | *[http://korpus.pl/en/ IPI PAN Polish Corpus] | ||
+ | *[http://www.ipds.uni-kiel.de/links/datenmaterial.en.html Kiel University's Institute on Phonetics and Speech Procesing] | ||
+ | *[http://www.nilc.icmc.usp.br/lacioweb Lacio Web Corpora] | ||
+ | *[http://www.vuw.ac.nz/llc/ LANGUAGE LEARNING CENTER - ACADEMIC CORPUS] | ||
+ | *[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs] | ||
+ | *[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words] | ||
+ | *[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources] | ||
+ | *[ftp://ftp.cs.cornell.edu/pub/smart/med/ Medlars collection] | ||
+ | *[ftp://ftp.ox.ac.uk/pub/wordlists/ Miscellaneous Word Lists from Oxford University] | ||
+ | *[http://www.lpl.univ-aix.fr/projects/multext/ Multilingual Text Tools and Corpora] | ||
+ | *[http://www.census.gov/genealogy/names Name lists from US census] | ||
+ | *[http://www.di.fc.ul.pt/~ahb/nexing.htm Nexing Corpus] | ||
+ | *[http://www.cs.cmu.edu/web/books.html On-line books at CMU] | ||
+ | *[http://logos.uio.no/opus/ OPUS -- An Open Source Parallel Corpus] | ||
+ | *[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers] | ||
+ | *[http://elex.amu.edu.pl/~przemka/PICLE_search.php Polish subcorpus of the International Corpus of Learner English] | ||
+ | *[http://www.cirp.es/WXN/wxn/frames/proxectos.html Ramon Piero Center for Research] | ||
+ | *[http://about.reuters.com/researchandstandards/corpus/ Reuters Corpus] | ||
+ | *[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP] | ||
+ | *[http://sanskritlibrary.org/ Sanskrit Library] | ||
+ | *[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus] | ||
+ | *[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio] | ||
+ | *[http://www.ldc.upenn.edu/Catalog/LDC2001S99.html Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio] | ||
+ | *[http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)] | ||
+ | *[http://www.ucl.ac.uk/english-usage/ Survey of English Usage, University College, London] | ||
+ | *[http://www.icsi.berkeley.edu/real/stp/index.html Switchboard Transcription Project] | ||
+ | *[http://www.tractor.de/ TELRI Research Archive of Computational Tools and Resources] | ||
+ | *[http://childes.psy.cmu.edu/ The Childes Corpus - Children's language] | ||
+ | *[http://nora.hd.uib.no/index-e.html The CORPORA DataCenter (Norway)] | ||
+ | *[ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/ The Moby Corpus] | ||
+ | *[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts] | ||
+ | *[http://www.sketchengine.co.uk/ The Sketch Engine] | ||
+ | *[http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm The Sofie Treebank - A Parallel Treebank of North European Languages] | ||
+ | *[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme] |
Revision as of 13:17, 2 November 2006
This list needs some cleaning. Please help.
English corpora
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- Collins Wordbanks
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- Gutenberg
- ICAME
- List of English stopwords
- Mapping WordNet Versions 1.6 and 2.0
- Movie Review Data
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- WebCorp
German corpora
Multilingual corpora
- ACQUIS COMMUNAUTAIRE Multilingual Corpus
- Bank of Swedish
- Croatian National Corpus (HNK)
- Czech National Corpus (CNC)
- CELEX - The Dutch Center for Lexical Information
- Centre for Disease Control - Chinese, French, Japanese, Spanish info on SARS
- COMPARA corpus
- Debian free software community
- EMILLE corpus
- European Parliament Proceedings Parallel Corpus 1996-2003
- EuroWordNet
- French Foreign Ministry's magazine
- GlossaNet
- Haitian Creole corpus -Teknoloji pou lang kreyol
- Hungarian National Corpus
- Hansard French-English parallel corpus
- ICE corpora
- IPI PAN Corpus of Polish
- Learner Behaviour on the Internet
- MuchMore Springer Bilingual Corpus
- MULTEXT-East: Multilingual Corpora for Eastern and Central European Languages
- Multilingual Corpora: Available Resources
- Tanaka Corpus: Japanese-English sentence pairs
- MultiSemCor
- Newspapers on the Internet
- OPUS - an open source parallel corpus
- Oslo Corpus of Bosnian
- PolyU Language Bank
- Portuguese Corpus
- Public registry of the Council of the EU
- Russian National Corpus (RNK)
- The Bible as a Resource for Translation Software
- The ECI Multilingual corpus
- Slovenian Corpus FIDA and FIDA+
- Spanish Corpus
- UN declaration of human rights in multiple languages
- UNITEX
- Useful links about parallel corpora, by Olivier Kraif
- WaCky Project
- Wortlisten: spoken German, English, French, and Dutch
Russian
- Russian Corpora
- Russian Corpora
- Russian Corpus Page
- Russian Corpus Site
- Russian Corpus Site
- Russian Newspaper Corpus
- Russicon Resources
Slovak
Uncategorized
- 1963 Time Magazine corpus
- 2000 NIST Speaker Recognition Evaluation Corpus
- A Syntactically Annotated Corpus of German Newspaper Texts
- A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)
- Alpino Treebank
- An Empirical Grammar of the English Verb System
- Annotated list of resources on statistical NLP and corpus-based CL
- AOT
- Arabic Newswire Part 1
- Base Textuelle de Moyen Francais
- BNC Online Service
- Bokr Russian Reference Corpus
- BRITISH NATIONAL CORPUS - WORLD EDITION
- Collections of texts and corpora
- Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular
- Corpus del Espanol
- Corpus of spoken Bulgarian
- Corpus Resources (Chulalongkorn University, Thailand)
- Cranfield collection
- CREA
- Czech National Corpus
- Danish news corpus
- Edinburgh Associative Thesaurus (EAT)
- EuroWordNet
- Experimental Corpus Query System (University of Stuttgart, Germany)
- Finnish text bank
- HAITIAN CREOLE ELECTRONIC TEXTS
- Hansards Corpus - Searchable
- HCRC Map Task Corpus XML annotations
- Helsinki Corpus of Swahili (HCS)
- ICOPOST
- IMS Corpus Toolbox, Univ. of Stuttgart
- IMS Corpus Workbench (CWB)
- International Corpus of Learner English
- IPI PAN Polish Corpus
- Kiel University's Institute on Phonetics and Speech Procesing
- Lacio Web Corpora
- LANGUAGE LEARNING CENTER - ACADEMIC CORPUS
- list of Japanese transitive - intransitive verb pairs
- List of stop words
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Medlars collection
- Miscellaneous Word Lists from Oxford University
- Multilingual Text Tools and Corpora
- Name lists from US census
- Nexing Corpus
- On-line books at CMU
- OPUS -- An Open Source Parallel Corpus
- Oxford Text Archive Corpus of Italian Newspapers
- Polish subcorpus of the International Corpus of Learner English
- Ramon Piero Center for Research
- Reuters Corpus
- Romanian NLP
- Sanskrit Library
- Slovene-English Parallel Corpus
- Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
- Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio
- Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)
- Survey of English Usage, University College, London
- Switchboard Transcription Project
- TELRI Research Archive of Computational Tools and Resources
- The Childes Corpus - Children's language
- The CORPORA DataCenter (Norway)
- The Moby Corpus
- The Oslo Corpus of Bosnian Texts
- The Sketch Engine
- The Sofie Treebank - A Parallel Treebank of North European Languages
- Treebank tokenization scheme