Difference between revisions of "Corpora for English"

From ACL Wiki
Jump to navigation Jump to search
(Added: Araneum)
(start work on cleaning up this mess)
Line 2: Line 2:
 
<!-- Please keep this list in alphabetical order -->
 
<!-- Please keep this list in alphabetical order -->
  
*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
+
===Free and Downloadable===
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
 
 
*[http://americannationalcorpus.org/ American National Corpus (ANC)]
 
*[http://americannationalcorpus.org/ American National Corpus (ANC)]
*[http://americannationalcorpus.org/FirstRelease/ AMERICAN NATIONAL CORPUS FIRST RELEASE]
 
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus
 
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus
 
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
 
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus]
 
*[http://devoted.to/corpora Bookmarks for Corpus-based Linguists]
 
*[http://info.ox.ac.uk/bnc/ British National Corpus (from Oxford University)]
 
*[http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
 
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 
*[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb]
 
*[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus]
 
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 
 
*[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
 
*[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
*[http://www.athel.com/corpdes.html Corpus of Spoken Professional English]
 
 
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
 
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
*[http://etext.lib.virginia.edu/ Electronic Text Center -- University of Virginia]
 
*[http://www.phon.ox.ac.uk/~esther/ivyweb/ English Intonation in the British Isles -The IViE Corpus]
 
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)]
 
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
 
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
 
 
*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
 
*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
+
*[https://www.gutenberg.org Project Gutenberg]
 
*[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
 
*[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
 
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
 
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
*[http://nora.hd.uib.no/icame.html ICAME]
 
 
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
 
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords]
 
 
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data]
 
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data]
 
*[http://www.cs.pitt.edu/mpqa/ Multi-Perspective Question Answering (MPQA)]
 
*[http://www.cs.pitt.edu/mpqa/ Multi-Perspective Question Answering (MPQA)]
Line 54: Line 32:
 
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
 
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
  
 +
===Proprietary===
 +
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus
 +
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus
 +
*[http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 +
*[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb]
 +
*[http://www.athel.com/cpsa.html Corpus of Spoken Professional English]
 +
*[http://www.phon.ox.ac.uk/~esther/ivyweb/ English Intonation in the British Isles -The IViE Corpus]
 +
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)]
 +
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
 +
 +
 +
 +
<!-- Dead links
 +
*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
 +
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
 +
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
 +
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus]
 +
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
 +
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 +
*[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus]
 +
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 +
*[http://etext.lib.virginia.edu/ Electronic Text Center -- University of Virginia]
 +
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
 +
*[http://nora.hd.uib.no/icame.html ICAME]
 +
 +
-->
  
 
==Link collections==
 
==Link collections==

Revision as of 09:21, 17 June 2015

For languages other than English, see List of resources by language.

Free and Downloadable

Proprietary



Link collections

Corpora tools