Corpora for English
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
For languages other than English, see List of resources by language.
- 1963 Time Magazine corpus
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- ClueWeb
- CODA Parallel Annotated Monologue-Dialogue Corpus
- Collins Wordbanks
- Congressional floor-debate transcripts, with support/oppose labels
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- GOV2 Corpus - 426 gigabytes of text
- Gutenberg
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- ICAME
- Large Text Compression Benchmark's 1G sample of Wikipedia
- List of English stopwords
- Movie Review Data
- Multi-Perspective Question Answering (MPQA)
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- WaCky
- WebCorp
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Isabella Chiari: Corpora, Software and Linguistic resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme