Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
(more cleanup) |
Sean Bethard (talk | contribs) m (Move *[http://www.grsampson.net/RSue.html SUSANNE Analytic Scheme] from Uncategorized resource to Resources for English, Corpora for English, Free and Downloadable) |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 8: | Line 8: | ||
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)] | *[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)] | ||
*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus | *[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus | ||
+ | *[https://corpling.uis.georgetown.edu/gum/ GUM - Georgetown University Multilayer corpus], multiple parses, coreference, entities, sentence types and RST | ||
*[https://www.gutenberg.org Project Gutenberg] | *[https://www.gutenberg.org Project Gutenberg] | ||
+ | *[http://www.ucl.ac.uk/english-usage/ice/avail.htm International Corpus of English] | ||
*[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | *[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | ||
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia] | *[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia] | ||
Line 15: | Line 17: | ||
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | *[http://mwe.stanford.edu/resources/ Multiword Expression Resources] | ||
*[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus] | *[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus] | ||
+ | *[http://www.grsampson.net/RSue.html SUSANNE Analytic Scheme] | ||
*[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English] | *[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English] | ||
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | *[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation] | ||
Line 66: | Line 69: | ||
<!-- Please keep this list in alphabetical order --> | <!-- Please keep this list in alphabetical order --> | ||
+ | *[http://corpus-tools.org/annis/ ANNIS] - open source search tool for complex multilayer corpora | ||
*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words] | *[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words] | ||
*[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer | *[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer |
Latest revision as of 17:58, 2 September 2019
For languages other than English, see List of resources by language.
Free and Downloadable
- American National Corpus (ANC)
- Congressional floor-debate transcripts, with support/oppose labels
- Dialogue Diversity Corpus
- English stop words (from SMART)
- Groningen Meaning Bank semantically annotated corpus
- GUM - Georgetown University Multilayer corpus, multiple parses, coreference, entities, sentence types and RST
- Project Gutenberg
- International Corpus of English
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- Large Text Compression Benchmark's 1G sample of Wikipedia
- Movie Review Data
- Multiword Expression Resources
- Susanne: Annotated American English Corpus
- SUSANNE Analytic Scheme
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- UMBC Webbase Corpus
- UN parallel corpora
- VP Ellipsis corpus
- WMT corpora, including Europarl, News Commentary, and News Crawl
Proprietary or Require Prior Permission
- Araneum Anglicum, Gigaword English web corpus
- Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
- British National Corpus (BNC)
- ClueWeb
- Corpus of Spoken Professional English
- English Intonation in the British Isles -The IViE Corpus
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- GOV2 Corpus - 426 gigabytes of text
- Multi-Perspective Question Answering (MPQA)
- Oxford English Corpus
- Sketch Engine
- WaCky
- WebCorp
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- ANNIS - open source search tool for complex multilayer corpora
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme