Difference between revisions of "Corpora for English"

From ACL Wiki

Jump to navigation Jump to search

Revision as of 07:52, 10 June 2016

For languages other than English, see List of resources by language.

Free and Downloadable

American National Corpus (ANC)
Congressional floor-debate transcripts, with support/oppose labels
Dialogue Diversity Corpus
English stop words (from SMART)
Groningen Meaning Bank semantically annotated corpus
GUM - Georgetown University Multilayer corpus, multiple parses, coreference, entities, sentence types and RST
Project Gutenberg
HamleDT, harmonized dependency treebanks of many languages, common annotation style.
Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
Large Text Compression Benchmark's 1G sample of Wikipedia
Movie Review Data
Multiword Expression Resources
Susanne: Annotated American English Corpus
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
The LUCY Corpus - Documentation
TRAINS Dialogue Corpus
UMBC Webbase Corpus
UN parallel corpora
VP Ellipsis corpus
WMT corpora, including Europarl, News Commentary, and News Crawl

Proprietary or Require Prior Permission

Araneum Anglicum, Gigaword English web corpus
Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
British National Corpus (BNC)
ClueWeb
Corpus of Spoken Professional English
English Intonation in the British Isles -The IViE Corpus
English Verb Classes And Alternations: A Preliminary Investigation (Index)
GOV2 Corpus - 426 gigabytes of text
Multi-Perspective Question Answering (MPQA)
Oxford English Corpus
Sketch Engine
WaCky
WebCorp

Link collections

Corpora tools

ANNIS - open source search tool for complex multilayer corpora
List of stop words
Poliqarp - open source XML-aware indexer, search engine and concordancer
The Sketch Engine
Treebank tokenization scheme

Retrieved from "https://aclweb.org/aclwiki/index.php?title=Corpora_for_English&oldid=11523"

Corpora