Difference between revisions of "Corpora for English"

For languages other than English, see List of resources by language.

Please help us move non-English items below into the List of resources by language.

English

American English SpeechDat-Car
American National Corpus (ANC)
AMERICAN NATIONAL CORPUS FIRST RELEASE
Biomedical corpora
BNCweb a web-based interface to the British National Corpus
Bookmarks for Corpus-based Linguists
British National Corpus (from Oxford University)
British National Corpus (BNC)
British National Corpus project page (from UCREL)
Brown Corpus
Collins Wordbanks
Congressional floor-debate transcripts, with support/oppose labels
Corpus of Spoken Professional English
Dialogue Diversity Corpus
Electronic Text Center -- University of Virginia
English Intonation in the British Isles -The IViE Corpus
English stop words (from SMART)
English Verb Classes And Alternations: A Preliminary Investigation (Index)
Exploring Words and Phrases from the British National Corpus
GOV2 Corpus - 426 gigabytes of text
Gutenberg
Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
ICAME
Large Text Compression Benchmark's 1G sample of Wikipedia
List of English stopwords
Movie Review Data
Multi-Perspective Question Answering (MPQA)
Multiword Expression Resources
Oxford English Corpus
Phrases in English
Restricted English Corpus from Dr. Caroline Lyon for PhD
Sketch Engine
Susanne: Annotated American English Corpus
The BNC Index (for the BNCWorld Edition)
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
The Dialogue Diversity Corpus
The LUCY Corpus - Documentation
TRAINS Dialogue Corpus
WebCorp

Slovak

Slovak National Corpus

Italian

LIP - Lessico di frequenza dell'Italiano Parlato - Access via BADIP
ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto
Corpus di Italiano Scritto contemporaneo (CORIS/CODIS)
Tesoro della lingua italiana delle origini (TLIO)

Link collections

Collections of texts and corpora
Manuel Barbera: General Corpora and Corpus Linguistics Resources
Isabella Chiari: Corpora, Software and Linguistic resources
Annotated list of resources on statistical NLP and corpus-based CL

Corpora tools

List of stop words
Poliqarp - open source XML-aware indexer, search engine and concordancer
The Sketch Engine
Treebank tokenization scheme

Uncategorized

@@ Line 48: / Line 48: @@
 *[http://www.webcorp.org.uk/guide/ WebCorp]
-==Galician==
-<!-- Please keep this list in alphabetical order -->
-*[http://sli.uvigo.es/CLUVI/ Linguistic Corpus of the University of Vigo (CLUVI)]
-*[http://sli.uvigo.es/CTG/ Technical Corpus of Galician (CTG)]
-*[http://www.ti.usc.es/TILG/ Tesouro informatizado da lingua galega (TILG)]
-==German==
-<!-- Please keep this list in alphabetical order -->
-*[http://www.phonetik.uni-muenchen.de/Bas/BasKorporaeng.html Bavarian Archive for Speech Signals Corpora]
-*[http://corpora.ids-mannheim.de/~cosmas/ COSMAS II]
-*[http://www.coli.uni-sb.de/sfb378/negra-corpus/negra-corpus.html NEGRA Corpus]
-==Iranian==
-<!-- Please keep this list in alphabetical order -->
-*[http://ece.ut.ac.ir/DBRG/Bijankhan/ Bijankhan corpus]
-*[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S50 CALLFRIEND Farsi (speech)]
-*[http://ece.ut.ac.ir/dbrg/hamshahri/ Hamshahri corpus]
-*[http://www.elda.org/catalogue/en/speech/S0112.html Persian speech database Farsdat]
-==Russian==
-<!-- Please keep this list in alphabetical order -->
-*[http://bokrcorpora.narod.ru Bokr Russian Reference Corpus]
-*[http://www.slav.helsinki.fi/hanco/index_en.html HANCO: The Helsinki annotated corpus of Russian texts]
-*[http://www.sfb441.uni-tuebingen.de/b1/korpora.html Russian Corpora]
-*[http://rykov-cl.narod.ru/r.html Russian Corpora]
-*[http://lib.ru/ Russian Corpus Site]
-*[http://www.ruscorpora.ru/ The Russian National Corpus]
-*[http://www.philol.msu.ru/~lex/corpus/ Russian Newspaper Corpus]
-*[http://schools.keldysh.ru/uvk1838/Sciper/volume2/langres/russiclr.htm Russicon Resources]
 ==Slovak==

Difference between revisions of "Corpora for English"

Revision as of 19:19, 24 April 2008

Contents

English

Slovak

Italian

Link collections

Corpora tools

Uncategorized

Arabic

Bosnian

Bulgarian

Croatian

Czech

Danish

English

Finnish

French

German

Haitian Creole

Italian

Japanese

Polish

Romanian

Sanskrit

Slovenian

Spanish

Swahili

Navigation menu

Search