Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
(start work on cleaning up this mess) |
(more cleanup) |
||
Line 13: | Line 13: | ||
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia] | *[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia] | ||
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data] | *[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data] | ||
− | |||
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | *[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | ||
− | |||
− | |||
− | |||
− | |||
*[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus] | *[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus] | ||
− | |||
*[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English] | *[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English] | ||
− | |||
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | *[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | ||
*[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus] | *[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus] | ||
Line 28: | Line 21: | ||
*[http://www.euromatrixplus.net/multi-un/ UN parallel corpora] | *[http://www.euromatrixplus.net/multi-un/ UN parallel corpora] | ||
*[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus] | *[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus] | ||
− | + | * [http://www.statmt.org/wmt15/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl | |
− | |||
− | * [http://www.statmt.org/ | ||
− | ===Proprietary=== | + | ===Proprietary or Require Prior Permission=== |
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus | *[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus | ||
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus | *[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus | ||
Line 41: | Line 32: | ||
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)] | *[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)] | ||
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text | *[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text | ||
− | + | *[http://mpqa.cs.pitt.edu Multi-Perspective Question Answering (MPQA)] | |
+ | *[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus] | ||
+ | *[http://www.sketchengine.co.uk/ Sketch Engine] | ||
+ | *[http://wacky.sslmit.unibo.it/ WaCky] | ||
+ | *[http://www.webcorp.org.uk/guide/ WebCorp] | ||
Line 56: | Line 51: | ||
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus] | *[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus] | ||
*[http://nora.hd.uib.no/icame.html ICAME] | *[http://nora.hd.uib.no/icame.html ICAME] | ||
− | + | *[http://pie.usna.edu/ Phrases in English] | |
+ | *[http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps Restricted English Corpus from Dr. Caroline Lyon for PhD] | ||
+ | *[http://clix.to/davidlee00 The BNC Index (for the BNCWorld Edition)] | ||
--> | --> | ||
Revision as of 08:43, 17 June 2015
For languages other than English, see List of resources by language.
Free and Downloadable
- American National Corpus (ANC)
- Congressional floor-debate transcripts, with support/oppose labels
- Dialogue Diversity Corpus
- English stop words (from SMART)
- Groningen Meaning Bank semantically annotated corpus
- Project Gutenberg
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- Large Text Compression Benchmark's 1G sample of Wikipedia
- Movie Review Data
- Multiword Expression Resources
- Susanne: Annotated American English Corpus
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- UMBC Webbase Corpus
- UN parallel corpora
- VP Ellipsis corpus
- WMT corpora, including Europarl, News Commentary, and News Crawl
Proprietary or Require Prior Permission
- Araneum Anglicum, Gigaword English web corpus
- Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
- British National Corpus (BNC)
- ClueWeb
- Corpus of Spoken Professional English
- English Intonation in the British Isles -The IViE Corpus
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- GOV2 Corpus - 426 gigabytes of text
- Multi-Perspective Question Answering (MPQA)
- Oxford English Corpus
- Sketch Engine
- WaCky
- WebCorp
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme