Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
(HamleDT) |
|||
(42 intermediate revisions by 17 users not shown) | |||
Line 1: | Line 1: | ||
− | + | For languages other than English, see [[List of resources by language]]. | |
− | + | <!-- Please keep this list in alphabetical order --> | |
− | |||
+ | *[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus] | ||
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | *[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car] | ||
*[http://americannationalcorpus.org/ American National Corpus (ANC)] | *[http://americannationalcorpus.org/ American National Corpus (ANC)] | ||
Line 13: | Line 13: | ||
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)] | *[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)] | ||
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus] | *[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus] | ||
+ | *[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb] | ||
+ | *[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus] | ||
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks] | *[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks] | ||
+ | *[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels] | ||
*[http://www.athel.com/corpdes.html Corpus of Spoken Professional English] | *[http://www.athel.com/corpdes.html Corpus of Spoken Professional English] | ||
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus] | *[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus] | ||
Line 21: | Line 24: | ||
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)] | *[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)] | ||
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus] | *[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus] | ||
+ | *[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text | ||
+ | *[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus | ||
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg] | *[http://www.gutenberg.org/wiki/Main_Page Gutenberg] | ||
+ | *[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | ||
+ | *[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia] | ||
*[http://nora.hd.uib.no/icame.html ICAME] | *[http://nora.hd.uib.no/icame.html ICAME] | ||
+ | *[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia] | ||
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords] | *[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords] | ||
− | |||
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data] | *[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data] | ||
+ | *[http://www.cs.pitt.edu/mpqa/ Multi-Perspective Question Answering (MPQA)] | ||
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | *[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | ||
*[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus] | *[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus] | ||
Line 37: | Line 45: | ||
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | *[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | ||
*[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus] | *[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus] | ||
+ | *[http://ebiquity.umbc.edu/resource/html/id/351 UMBC Webbase Corpus] | ||
+ | *[http://www.euromatrixplus.net/multi-un/ UN parallel corpora] | ||
+ | *[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus] | ||
+ | *[http://wacky.sslmit.unibo.it/ WaCky] | ||
*[http://www.webcorp.org.uk/guide/ WebCorp] | *[http://www.webcorp.org.uk/guide/ WebCorp] | ||
− | + | * [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl | |
− | |||
− | |||
− | *[http://www. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | == | + | ==Link collections== |
− | + | <!-- Please keep this list in alphabetical order --> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora] | *[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora] | ||
*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources] | *[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources] | ||
− | |||
*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL] | *[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL] | ||
− | ==Corpora | + | ==Corpora tools== |
+ | <!-- Please keep this list in alphabetical order --> | ||
*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words] | *[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words] | ||
+ | *[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer | ||
*[http://www.sketchengine.co.uk/ The Sketch Engine] | *[http://www.sketchengine.co.uk/ The Sketch Engine] | ||
*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme] | *[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme] | ||
− | |||
− | + | [[Category:Corpora|*]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | * | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Revision as of 09:41, 26 May 2014
For languages other than English, see List of resources by language.
- 1963 Time Magazine corpus
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- ClueWeb
- CODA Parallel Annotated Monologue-Dialogue Corpus
- Collins Wordbanks
- Congressional floor-debate transcripts, with support/oppose labels
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- GOV2 Corpus - 426 gigabytes of text
- Groningen Meaning Bank semantically annotated corpus
- Gutenberg
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- ICAME
- Large Text Compression Benchmark's 1G sample of Wikipedia
- List of English stopwords
- Movie Review Data
- Multi-Perspective Question Answering (MPQA)
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- UMBC Webbase Corpus
- UN parallel corpora
- VP Ellipsis corpus
- WaCky
- WebCorp
- WMT corpora, including Europarl, News Commentary, and News Crawl
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme