Corpora for English
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
For languages other than English, see List of resources by language.
English
- 1963 Time Magazine corpus
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- Collins Wordbanks
- Congressional floor-debate transcripts, with support/oppose labels
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- GOV2 Corpus - 426 gigabytes of text
- Gutenberg
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- ICAME
- Large Text Compression Benchmark's 1G sample of Wikipedia
- List of English stopwords
- Movie Review Data
- Multi-Perspective Question Answering (MPQA)
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- WebCorp
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Isabella Chiari: Corpora, Software and Linguistic resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme
Finnish
French
German
- A Syntactically Annotated Corpus of German Newspaper Texts
- Experimental Corpus Query System (University of Stuttgart, Germany)
Haitian Creole
Italian
Japanese
Polish
Romanian
Sanskrit
Slovenian
Spanish
Swahili
Uncategorized
- 2000 NIST Speaker Recognition Evaluation Corpus
- A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)
- Alpino Treebank
- AOT
- Corpus Resources (Chulalongkorn University, Thailand)
- Cranfield collection
- CREA
- Edinburgh Associative Thesaurus (EAT)
- EuroWordNet
- Hansards Corpus - Searchable
- HCRC Map Task Corpus XML annotations
- ICOPOST
- IMS Corpus Toolbox, Univ. of Stuttgart
- IMS Corpus Workbench (CWB)
- International Corpus of Learner English
- Kiel University's Institute on Phonetics and Speech Procesing
- Lacio Web Corpora
- LANGUAGE LEARNING CENTER - ACADEMIC CORPUS
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Medlars collection
- Miscellaneous Word Lists from Oxford University
- Multilingual Text Tools and Corpora
- Name lists from US census
- Nexing Corpus
- On-line books at CMU
- OPUS -- An Open Source Parallel Corpus
- Polish subcorpus of the International Corpus of Learner English
- Ramon Piero Center for Research
- Reuters Corpus
- Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
- Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio
- Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)
- Survey of English Usage, University College, London
- Switchboard Transcription Project
- TELRI Research Archive of Computational Tools and Resources
- The Childes Corpus - Children's language
- The CORPORA DataCenter (Norway)
- The Moby Corpus
- The Sofie Treebank - A Parallel Treebank of North European Languages