Difference between revisions of "Corpora for English"

Latest revision as of 18:58, 2 September 2019

For languages other than English, see List of resources by language.

Free and Downloadable

American National Corpus (ANC)
Congressional floor-debate transcripts, with support/oppose labels
Dialogue Diversity Corpus
English stop words (from SMART)
Groningen Meaning Bank semantically annotated corpus
GUM - Georgetown University Multilayer corpus, multiple parses, coreference, entities, sentence types and RST
Project Gutenberg
International Corpus of English
HamleDT, harmonized dependency treebanks of many languages, common annotation style.
Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
Large Text Compression Benchmark's 1G sample of Wikipedia
Movie Review Data
Multiword Expression Resources
Susanne: Annotated American English Corpus
SUSANNE Analytic Scheme
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
The LUCY Corpus - Documentation
TRAINS Dialogue Corpus
UMBC Webbase Corpus
UN parallel corpora
VP Ellipsis corpus
WMT corpora, including Europarl, News Commentary, and News Crawl

Proprietary or Require Prior Permission

Araneum Anglicum, Gigaword English web corpus
Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
British National Corpus (BNC)
ClueWeb
Corpus of Spoken Professional English
English Intonation in the British Isles -The IViE Corpus
English Verb Classes And Alternations: A Preliminary Investigation (Index)
GOV2 Corpus - 426 gigabytes of text
Multi-Perspective Question Answering (MPQA)
Oxford English Corpus
Sketch Engine
WaCky
WebCorp

Link collections

Corpora tools

ANNIS - open source search tool for complex multilayer corpora
List of stop words
Poliqarp - open source XML-aware indexer, search engine and concordancer
The Sketch Engine
Treebank tokenization scheme

@@ Line 8: / Line 8: @@
 *[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
 *[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
+*[https://corpling.uis.georgetown.edu/gum/ GUM - Georgetown University Multilayer corpus], multiple parses, coreference, entities, sentence types and RST
 *[https://www.gutenberg.org Project Gutenberg]
+*[http://www.ucl.ac.uk/english-usage/ice/avail.htm International Corpus of English]
 *[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
 *[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
@@ Line 15: / Line 17: @@
 *[http://mwe.stanford.edu/resources/ Multiword Expression Resources]
 *[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus]
+*[http://www.grsampson.net/RSue.html SUSANNE Analytic Scheme]
 *[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English]
 *[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation]
@@ Line 66: / Line 69: @@
 <!-- Please keep this list in alphabetical order -->
+*[http://corpus-tools.org/annis/ ANNIS] - open source search tool for complex multilayer corpora
 *[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
 *[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer

Difference between revisions of "Corpora for English"

Latest revision as of 18:58, 2 September 2019

Contents

Free and Downloadable

Proprietary or Require Prior Permission

Link collections

Corpora tools

Navigation menu

Search