Difference between revisions of "Corpora for English"
Jump to navigation
Jump to search
(HamleDT) |
(Added: Araneum) |
||
Line 6: | Line 6: | ||
*[http://americannationalcorpus.org/ American National Corpus (ANC)] | *[http://americannationalcorpus.org/ American National Corpus (ANC)] | ||
*[http://americannationalcorpus.org/FirstRelease/ AMERICAN NATIONAL CORPUS FIRST RELEASE] | *[http://americannationalcorpus.org/FirstRelease/ AMERICAN NATIONAL CORPUS FIRST RELEASE] | ||
+ | *[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus | ||
+ | *[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus | ||
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora] | *[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora] | ||
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus] | *[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus] |
Revision as of 12:33, 8 March 2015
For languages other than English, see List of resources by language.
- 1963 Time Magazine corpus
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Araneum Anglicum, Gigaword English web corpus
- Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- ClueWeb
- CODA Parallel Annotated Monologue-Dialogue Corpus
- Collins Wordbanks
- Congressional floor-debate transcripts, with support/oppose labels
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- GOV2 Corpus - 426 gigabytes of text
- Groningen Meaning Bank semantically annotated corpus
- Gutenberg
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- ICAME
- Large Text Compression Benchmark's 1G sample of Wikipedia
- List of English stopwords
- Movie Review Data
- Multi-Perspective Question Answering (MPQA)
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- UMBC Webbase Corpus
- UN parallel corpora
- VP Ellipsis corpus
- WaCky
- WebCorp
- WMT corpora, including Europarl, News Commentary, and News Crawl
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme