Difference between revisions of "Corpora for English"

From ACL Wiki
Jump to navigation Jump to search
Line 24: Line 24:
 
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
 
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
 
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 +
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
 
*[http://nora.hd.uib.no/icame.html ICAME]
 
*[http://nora.hd.uib.no/icame.html ICAME]
 +
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords]
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords]
 
*[http://www.lsi.upc.es/~nlp/tools/mapping.html Mapping WordNet Versions 1.6 and 2.0]
 
*[http://www.lsi.upc.es/~nlp/tools/mapping.html Mapping WordNet Versions 1.6 and 2.0]

Revision as of 10:59, 7 November 2006

This list needs some cleaning. Please help.

English

German

Multilingual

Russian

Slovak

Italian

Link collections

Corpora tools

Uncategorized