Difference between revisions of "Corpora for English"

Revision as of 15:40, 10 December 2013

For languages other than English, see List of resources by language.

Link collections

Corpora tools

List of stop words
Poliqarp - open source XML-aware indexer, search engine and concordancer
The Sketch Engine
Treebank tokenization scheme

@@ Line 1: / Line 1: @@
 For languages other than English, see [[List of resources by language]].
-<div class="usermessage">
-* Please help us move non-English items below into the [[List of resources by language]].
-</div>
-==English==
 <!-- Please keep this list in alphabetical order -->
@@ Line 19: / Line 13: @@
 *[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
 *[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
+*[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb]
+*[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus]
 *[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 *[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
@@ Line 29: / Line 25: @@
 *[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
 *[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
+*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
 *[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 *[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
@@ Line 47: / Line 44: @@
 *[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation]
 *[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus]
+*[http://ebiquity.umbc.edu/resource/html/id/351 UMBC Webbase Corpus]
+*[http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
+*[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus]
+*[http://wacky.sslmit.unibo.it/ WaCky]
 *[http://www.webcorp.org.uk/guide/ WebCorp]
+* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
@@ Line 55: / Line 57: @@
 *[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
 *[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
-*[http://www.alphabit.net Isabella Chiari: Corpora, Software and Linguistic resources]
 *[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]
@@ Line 66: / Line 67: @@
 *[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]
-==Finnish==
-*[http://www.csc.fi/kielipankki/ Finnish text bank]
-==French==
-*[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais]
-==German==
-*[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts]
-*[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)]
-==Haitian Creole==
-*[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS]
-==Italian==
-*[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers]
-==Japanese==
-*[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs]
-==Polish==
-*[http://korpus.pl/en/ IPI PAN Polish Corpus]
-==Romanian==
-*[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP]
-==Sanskrit==
-*[http://sanskritlibrary.org/ Sanskrit Library]
-==Slovenian==
-*[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus]
-==Spanish==
-*[http://www.corpusdelespanol.org/ Corpus del Espanol]
-*[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular]
-==Swahili==
-*[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]
-==Uncategorized==
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus]
-*[http://ixa.si.ehu.es/Ixa/resources/sensecorpus A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)]
-*[http://odur.let.rug.nl/~vannoord/trees/ Alpino Treebank]
-*[http://www.aot.ru/search1.html AOT]
-*[http://pioneer.chula.ac.th/~awirote/ling/corpuslst.htm Corpus Resources (Chulalongkorn University, Thailand)]
-*[ftp://ftp.cs.cornell.edu/pub/smart/cran/ Cranfield collection]
-*[http://corpus.rae.es/creanet.html CREA]
-*[http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)]
-*[http://www.hum.uva.nl/~ewn EuroWordNet]
-*[http://rali.iro.umontreal.ca/ Hansards Corpus - Searchable]
-*[http://www.hcrc.ed.ac.uk/maptask/ HCRC Map Task Corpus XML annotations]
-*[http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/ ICOPOST]
-*[http://www.ims.uni-stuttgart.de/projekte/TC.html IMS Corpus Toolbox, Univ. of Stuttgart]
-*[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ IMS Corpus Workbench (CWB)]
-*[http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm International Corpus of Learner English]
-*[http://www.ipds.uni-kiel.de/links/datenmaterial.en.html Kiel University's Institute on Phonetics and Speech Procesing]
-*[http://www.nilc.icmc.usp.br/lacioweb Lacio Web Corpora]
-*[http://www.vuw.ac.nz/llc/ LANGUAGE LEARNING CENTER - ACADEMIC CORPUS]
-*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
-*[ftp://ftp.cs.cornell.edu/pub/smart/med/ Medlars collection]
-*[ftp://ftp.ox.ac.uk/pub/wordlists/ Miscellaneous Word Lists from Oxford University]
-*[http://www.lpl.univ-aix.fr/projects/multext/ Multilingual Text Tools and Corpora]
-*[http://www.census.gov/genealogy/names Name lists from US census]
-*[http://www.di.fc.ul.pt/~ahb/nexing.htm Nexing Corpus]
-*[http://www.cs.cmu.edu/web/books.html On-line books at CMU]
-*[http://logos.uio.no/opus/ OPUS -- An Open Source Parallel Corpus]
-*[http://elex.amu.edu.pl/~przemka/PICLE_search.php Polish subcorpus of the International Corpus of Learner English]
-*[http://www.cirp.es/WXN/wxn/frames/proxectos.html Ramon Piero Center for Research]
-*[http://about.reuters.com/researchandstandards/corpus/ Reuters Corpus]
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio]
-*[http://www.ldc.upenn.edu/Catalog/LDC2001S99.html Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio]
-*[http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)]
-*[http://www.ucl.ac.uk/english-usage/ Survey of English Usage, University College, London]
-*[http://www.icsi.berkeley.edu/real/stp/index.html Switchboard Transcription Project]
-*[http://www.tractor.de/ TELRI Research Archive of Computational Tools and Resources]
-*[http://childes.psy.cmu.edu/ The Childes Corpus - Children's language]
-*[http://nora.hd.uib.no/index-e.html The CORPORA DataCenter (Norway)]
-*[ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/ The Moby Corpus]
-*[http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm The Sofie Treebank - A Parallel Treebank of North European Languages]
 [[Category:Corpora|*]]

Difference between revisions of "Corpora for English"

Revision as of 15:40, 10 December 2013

Link collections

Corpora tools

Navigation menu

Search