Corpora for English

From ACL Wiki

Revision as of 09:43, 17 June 2015 by Jonsafari (talk | contribs) (more cleanup)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

For languages other than English, see List of resources by language.

Free and Downloadable

American National Corpus (ANC)
Congressional floor-debate transcripts, with support/oppose labels
Dialogue Diversity Corpus
English stop words (from SMART)
Groningen Meaning Bank semantically annotated corpus
Project Gutenberg
HamleDT, harmonized dependency treebanks of many languages, common annotation style.
Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
Large Text Compression Benchmark's 1G sample of Wikipedia
Movie Review Data
Multiword Expression Resources
Susanne: Annotated American English Corpus
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
The LUCY Corpus - Documentation
TRAINS Dialogue Corpus
UMBC Webbase Corpus
UN parallel corpora
VP Ellipsis corpus
WMT corpora, including Europarl, News Commentary, and News Crawl

Proprietary or Require Prior Permission

Araneum Anglicum, Gigaword English web corpus
Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
British National Corpus (BNC)
ClueWeb
Corpus of Spoken Professional English
English Intonation in the British Isles -The IViE Corpus
English Verb Classes And Alternations: A Preliminary Investigation (Index)
GOV2 Corpus - 426 gigabytes of text
Multi-Perspective Question Answering (MPQA)
Oxford English Corpus
Sketch Engine
WaCky
WebCorp

Link collections

Corpora tools

List of stop words
Poliqarp - open source XML-aware indexer, search engine and concordancer
The Sketch Engine
Treebank tokenization scheme

Retrieved from "https://aclweb.org/aclwiki/index.php?title=Corpora_for_English&oldid=11078"

Corpora