Difference between revisions of "Corpora for English"

From ACL Wiki

Jump to navigation Jump to search

Revision as of 13:17, 2 November 2006

This list needs some cleaning. Please help.

English corpora

American English SpeechDat-Car
American National Corpus (ANC)
AMERICAN NATIONAL CORPUS FIRST RELEASE
Biomedical corpora
BNCweb a web-based interface to the British National Corpus
Bookmarks for Corpus-based Linguists
British National Corpus (from Oxford University)
British National Corpus (BNC)
British National Corpus project page (from UCREL)
Brown Corpus
Collins Wordbanks
Corpus of Spoken Professional English
Dialogue Diversity Corpus
Electronic Text Center -- University of Virginia
English Intonation in the British Isles -The IViE Corpus
English stop words (from SMART)
English Verb Classes And Alternations: A Preliminary Investigation (Index)
Exploring Words and Phrases from the British National Corpus
Gutenberg
ICAME
List of English stopwords
Mapping WordNet Versions 1.6 and 2.0
Movie Review Data
Multiword Expression Resources
Oxford English Corpus
Phrases in English
Restricted English Corpus from Dr. Caroline Lyon for PhD
Sketch Engine
Susanne: Annotated American English Corpus
The BNC Index (for the BNCWorld Edition)
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
The Dialogue Diversity Corpus
The LUCY Corpus - Documentation
TRAINS Dialogue Corpus
WebCorp

German corpora

Bavarian Archive for Speech Signals Corpora
COSMAS II
NEGRA Corpus

Multilingual corpora

ACQUIS COMMUNAUTAIRE Multilingual Corpus
Bank of Swedish
Croatian National Corpus (HNK)
Czech National Corpus (CNC)
CELEX - The Dutch Center for Lexical Information
Centre for Disease Control - Chinese, French, Japanese, Spanish info on SARS
COMPARA corpus
Debian free software community
EMILLE corpus
European Parliament Proceedings Parallel Corpus 1996-2003
EuroWordNet
French Foreign Ministry's magazine
GlossaNet
Haitian Creole corpus -Teknoloji pou lang kreyol
Hungarian National Corpus
Hansard French-English parallel corpus
ICE corpora
IPI PAN Corpus of Polish
Learner Behaviour on the Internet
MuchMore Springer Bilingual Corpus
MULTEXT-East: Multilingual Corpora for Eastern and Central European Languages
Multilingual Corpora: Available Resources
Tanaka Corpus: Japanese-English sentence pairs
MultiSemCor
Newspapers on the Internet
OPUS - an open source parallel corpus
Oslo Corpus of Bosnian
PolyU Language Bank
Portuguese Corpus
Public registry of the Council of the EU
Russian National Corpus (RNK)
The Bible as a Resource for Translation Software
The ECI Multilingual corpus
Slovenian Corpus FIDA and FIDA+
Spanish Corpus
UN declaration of human rights in multiple languages
UNITEX
Useful links about parallel corpora, by Olivier Kraif
WaCky Project
Wortlisten: spoken German, English, French, and Dutch

Russian

Russian Corpora
Russian Corpora
Russian Corpus Page
Russian Corpus Site
Russian Corpus Site
Russian Newspaper Corpus
Russicon Resources

Slovak

Slovak National Corpus

Uncategorized

1963 Time Magazine corpus
2000 NIST Speaker Recognition Evaluation Corpus
A Syntactically Annotated Corpus of German Newspaper Texts
A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)
Alpino Treebank
An Empirical Grammar of the English Verb System
Annotated list of resources on statistical NLP and corpus-based CL
AOT
Arabic Newswire Part 1
Base Textuelle de Moyen Francais
BNC Online Service
Bokr Russian Reference Corpus
BRITISH NATIONAL CORPUS - WORLD EDITION
Collections of texts and corpora
Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular
Corpus del Espanol
Corpus of spoken Bulgarian
Corpus Resources (Chulalongkorn University, Thailand)
Cranfield collection
CREA
Czech National Corpus
Danish news corpus
Edinburgh Associative Thesaurus (EAT)
EuroWordNet
Experimental Corpus Query System (University of Stuttgart, Germany)
Finnish text bank
HAITIAN CREOLE ELECTRONIC TEXTS
Hansards Corpus - Searchable
HCRC Map Task Corpus XML annotations
Helsinki Corpus of Swahili (HCS)
ICOPOST
IMS Corpus Toolbox, Univ. of Stuttgart
IMS Corpus Workbench (CWB)
International Corpus of Learner English
IPI PAN Polish Corpus
Kiel University's Institute on Phonetics and Speech Procesing
Lacio Web Corpora
LANGUAGE LEARNING CENTER - ACADEMIC CORPUS
list of Japanese transitive - intransitive verb pairs
List of stop words
Manuel Barbera: General Corpora and Corpus Linguistics Resources
Medlars collection
Miscellaneous Word Lists from Oxford University
Multilingual Text Tools and Corpora
Name lists from US census
Nexing Corpus
On-line books at CMU
OPUS -- An Open Source Parallel Corpus
Oxford Text Archive Corpus of Italian Newspapers
Polish subcorpus of the International Corpus of Learner English
Ramon Piero Center for Research
Reuters Corpus
Romanian NLP
Sanskrit Library
Slovene-English Parallel Corpus
Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio
Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)
Survey of English Usage, University College, London
Switchboard Transcription Project
TELRI Research Archive of Computational Tools and Resources
The Childes Corpus - Children's language
The CORPORA DataCenter (Norway)
The Moby Corpus
The Oslo Corpus of Bosnian Texts
The Sketch Engine
The Sofie Treebank - A Parallel Treebank of North European Languages
Treebank tokenization scheme

Retrieved from "https://aclweb.org/aclwiki/index.php?title=Corpora_for_English&oldid=2375"

@@ Line 1: / Line 1: @@
-*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
+''This list needs some cleaning. Please help.''
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus]
-*[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts]
-*[http://ixa.si.ehu.es/Ixa/resources/sensecorpus A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)]
-*[http://odur.let.rug.nl/~vannoord/trees/ Alpino Treebank]
-*[http://www.cornelsen.de/international/ An Empirical Grammar of the English Verb System]
-*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]
-*[http://www.aot.ru/search1.html AOT]
-*[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1]
-*[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais]
-*[http://thetis.bl.uk/ BNC Online Service]
-*[http://bokrcorpora.narod.ru Bokr Russian Reference Corpus]
-*[http://info.ox.ac.uk/bnc/ BRITISH NATIONAL CORPUS - WORLD EDITION]
-*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
-*[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular]
-*[http://www.corpusdelespanol.org/ Corpus del Espanol]
-*[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian]
-*[http://pioneer.chula.ac.th/~awirote/ling/corpuslst.htm Corpus Resources (Chulalongkorn University, Thailand)]
-*[ftp://ftp.cs.cornell.edu/pub/smart/cran/ Cranfield collection]
-*[http://corpus.rae.es/creanet.html CREA]
-*[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus]
-*[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus]
-*[http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)]
-*[http://www.hum.uva.nl/~ewn EuroWordNet]
-*[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)]
-*[http://www.csc.fi/kielipankki/ Finnish text bank]
-*[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS]
-*[http://rali.iro.umontreal.ca/ Hansards Corpus - Searchable]
-*[http://www.hcrc.ed.ac.uk/maptask/ HCRC Map Task Corpus XML annotations]
-*[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]
-*[http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/ ICOPOST]
-*[http://www.ims.uni-stuttgart.de/projekte/TC.html IMS Corpus Toolbox, Univ. of Stuttgart]
-*[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ IMS Corpus Workbench (CWB)]
-*[http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm International Corpus of Learner English]
-*[http://korpus.pl/en/ IPI PAN Polish Corpus]
-*[http://www.ipds.uni-kiel.de/links/datenmaterial.en.html Kiel University's Institute on Phonetics and Speech Procesing]
-*[http://www.nilc.icmc.usp.br/lacioweb Lacio Web Corpora]
-*[http://www.vuw.ac.nz/llc/ LANGUAGE LEARNING CENTER - ACADEMIC CORPUS]
-*[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs]
-*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
-*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
-*[ftp://ftp.cs.cornell.edu/pub/smart/med/ Medlars collection]
-*[ftp://ftp.ox.ac.uk/pub/wordlists/ Miscellaneous Word Lists from Oxford University]
-*[http://www.lpl.univ-aix.fr/projects/multext/ Multilingual Text Tools and Corpora]
-*[http://www.census.gov/genealogy/names Name lists from US census]
-*[http://www.di.fc.ul.pt/~ahb/nexing.htm Nexing Corpus]
-*[http://www.cs.cmu.edu/web/books.html On-line books at CMU]
-*[http://logos.uio.no/opus/ OPUS -- An Open Source Parallel Corpus]
-*[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers]
-*[http://elex.amu.edu.pl/~przemka/PICLE_search.php Polish subcorpus of the International Corpus of Learner English]
-*[http://www.cirp.es/WXN/wxn/frames/proxectos.html Ramon Piero Center for Research]
-*[http://about.reuters.com/researchandstandards/corpus/ Reuters Corpus]
-*[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP]
-*[http://sanskritlibrary.org/ Sanskrit Library]
-*[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus]
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio]
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S99.html Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio]
-*[http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)]
-*[http://www.ucl.ac.uk/english-usage/ Survey of English Usage, University College, London]
-*[http://www.icsi.berkeley.edu/real/stp/index.html Switchboard Transcription Project]
-*[http://www.tractor.de/ TELRI Research Archive of Computational Tools and Resources]
-*[http://childes.psy.cmu.edu/ The Childes Corpus - Children's language]
-*[http://nora.hd.uib.no/index-e.html The CORPORA DataCenter (Norway)]
-*[ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/ The Moby Corpus]
-*[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts]
-*[http://www.sketchengine.co.uk/ The Sketch Engine]
-*[http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm The Sofie Treebank - A Parallel Treebank of North European Languages]
-*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]
 ==English corpora==
 *[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
 *[http://americannationalcorpus.org/ American National Corpus (ANC)]
@@ Line 167: / Line 101: @@
 *[http://korpus.juls.savba.sk/index.en.html Slovak National Corpus]
+==Uncategorized==
+*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
+*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus]
+*[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts]
+*[http://ixa.si.ehu.es/Ixa/resources/sensecorpus A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)]
+*[http://odur.let.rug.nl/~vannoord/trees/ Alpino Treebank]
+*[http://www.cornelsen.de/international/ An Empirical Grammar of the English Verb System]
+*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]
+*[http://www.aot.ru/search1.html AOT]
+*[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1]
+*[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais]
+*[http://thetis.bl.uk/ BNC Online Service]
+*[http://bokrcorpora.narod.ru Bokr Russian Reference Corpus]
+*[http://info.ox.ac.uk/bnc/ BRITISH NATIONAL CORPUS - WORLD EDITION]
+*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
+*[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular]
+*[http://www.corpusdelespanol.org/ Corpus del Espanol]
+*[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian]
+*[http://pioneer.chula.ac.th/~awirote/ling/corpuslst.htm Corpus Resources (Chulalongkorn University, Thailand)]
+*[ftp://ftp.cs.cornell.edu/pub/smart/cran/ Cranfield collection]
+*[http://corpus.rae.es/creanet.html CREA]
+*[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus]
+*[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus]
+*[http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)]
+*[http://www.hum.uva.nl/~ewn EuroWordNet]
+*[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)]
+*[http://www.csc.fi/kielipankki/ Finnish text bank]
+*[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS]
+*[http://rali.iro.umontreal.ca/ Hansards Corpus - Searchable]
+*[http://www.hcrc.ed.ac.uk/maptask/ HCRC Map Task Corpus XML annotations]
+*[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]
+*[http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/ ICOPOST]
+*[http://www.ims.uni-stuttgart.de/projekte/TC.html IMS Corpus Toolbox, Univ. of Stuttgart]
+*[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ IMS Corpus Workbench (CWB)]
+*[http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm International Corpus of Learner English]
+*[http://korpus.pl/en/ IPI PAN Polish Corpus]
+*[http://www.ipds.uni-kiel.de/links/datenmaterial.en.html Kiel University's Institute on Phonetics and Speech Procesing]
+*[http://www.nilc.icmc.usp.br/lacioweb Lacio Web Corpora]
+*[http://www.vuw.ac.nz/llc/ LANGUAGE LEARNING CENTER - ACADEMIC CORPUS]
+*[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs]
+*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
+*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
+*[ftp://ftp.cs.cornell.edu/pub/smart/med/ Medlars collection]
+*[ftp://ftp.ox.ac.uk/pub/wordlists/ Miscellaneous Word Lists from Oxford University]
+*[http://www.lpl.univ-aix.fr/projects/multext/ Multilingual Text Tools and Corpora]
+*[http://www.census.gov/genealogy/names Name lists from US census]
+*[http://www.di.fc.ul.pt/~ahb/nexing.htm Nexing Corpus]
+*[http://www.cs.cmu.edu/web/books.html On-line books at CMU]
+*[http://logos.uio.no/opus/ OPUS -- An Open Source Parallel Corpus]
+*[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers]
+*[http://elex.amu.edu.pl/~przemka/PICLE_search.php Polish subcorpus of the International Corpus of Learner English]
+*[http://www.cirp.es/WXN/wxn/frames/proxectos.html Ramon Piero Center for Research]
+*[http://about.reuters.com/researchandstandards/corpus/ Reuters Corpus]
+*[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP]
+*[http://sanskritlibrary.org/ Sanskrit Library]
+*[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus]
+*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio]
+*[http://www.ldc.upenn.edu/Catalog/LDC2001S99.html Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio]
+*[http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)]
+*[http://www.ucl.ac.uk/english-usage/ Survey of English Usage, University College, London]
+*[http://www.icsi.berkeley.edu/real/stp/index.html Switchboard Transcription Project]
+*[http://www.tractor.de/ TELRI Research Archive of Computational Tools and Resources]
+*[http://childes.psy.cmu.edu/ The Childes Corpus - Children's language]
+*[http://nora.hd.uib.no/index-e.html The CORPORA DataCenter (Norway)]
+*[ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/ The Moby Corpus]
+*[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts]
+*[http://www.sketchengine.co.uk/ The Sketch Engine]
+*[http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm The Sofie Treebank - A Parallel Treebank of North European Languages]
+*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]

Difference between revisions of "Corpora for English"

Revision as of 13:17, 2 November 2006

Contents

English corpora

German corpora

Multilingual corpora

Russian

Slovak

Uncategorized

Navigation menu

Search