Difference between revisions of "Corpora, datasets, lexicons"
Jump to navigation
Jump to search
Line 5: | Line 5: | ||
== Corpora == | == Corpora == | ||
+ | === English === | ||
+ | (alphabetical order) | ||
* [http://americannationalcorpus.org/ American National Corpus (ANC)] | * [http://americannationalcorpus.org/ American National Corpus (ANC)] | ||
* [http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora] | * [http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora] | ||
− | |||
* [http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)] | * [http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)] | ||
* [http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus] | * [http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus] | ||
* [http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks] | * [http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks] | ||
+ | * [http://www.gutenberg.org/wiki/Main_Page Gutenberg] | ||
+ | * [http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus] | ||
+ | * [http://www.webcorp.org.uk/guide/ WebCorp] | ||
+ | |||
+ | === Multilingual === | ||
+ | (alphabetical order) | ||
+ | * [http://spraakbanken.gu.se/ Bank of Swedish] | ||
+ | * [http://www.tekstlab.uio.no/Bosnian/Corpus.html Oslo Corpus of Bosnian] | ||
* [http://hnk.ffzg.hr/ Croatian National Corpus (HNK)] | * [http://hnk.ffzg.hr/ Croatian National Corpus (HNK)] | ||
* [http://ucnk.ff.cuni.cz/ Czech National Corpus (CNC)] | * [http://ucnk.ff.cuni.cz/ Czech National Corpus (CNC)] | ||
− | |||
− | |||
* [http://corpus.nytud.hu/mnsz/ Hungarian National Corpus] | * [http://corpus.nytud.hu/mnsz/ Hungarian National Corpus] | ||
* [http://korpus.pl/ IPI PAN Corpus of Polish] | * [http://korpus.pl/ IPI PAN Corpus of Polish] | ||
− | |||
* [http://www.corpusdoportugues.org/ Portuguese Corpus] | * [http://www.corpusdoportugues.org/ Portuguese Corpus] | ||
* [http://www.ruscorpora.ru/ Russian National Corpus (RNK)] | * [http://www.ruscorpora.ru/ Russian National Corpus (RNK)] | ||
Line 23: | Line 29: | ||
* [http://www.fida.net/ Slovenian Corpus FIDA] and [http://www.fidaplus.net/ FIDA+] | * [http://www.fida.net/ Slovenian Corpus FIDA] and [http://www.fidaplus.net/ FIDA+] | ||
* [http://www.corpusdelespanol.org/ Spanish Corpus] | * [http://www.corpusdelespanol.org/ Spanish Corpus] | ||
− | |||
* [http://www.csse.monash.edu.au/~jwb/tanakacorpus.html Tanaka Corpus: Japanese-English sentence pairs] | * [http://www.csse.monash.edu.au/~jwb/tanakacorpus.html Tanaka Corpus: Japanese-English sentence pairs] | ||
− | * [http:// | + | |
+ | === Other lists of corpora === | ||
+ | (alphabetical order) | ||
+ | * [http://devoted.to/corpora David Lee's Bookmarks for Corpus-based Linguists] | ||
== Datasets == | == Datasets == |
Revision as of 06:45, 2 November 2006
Miscellaneous
Corpora
English
(alphabetical order)
- American National Corpus (ANC)
- Biomedical corpora
- British National Corpus (BNC)
- Brown Corpus
- Collins Wordbanks
- Gutenberg
- Oxford English Corpus
- WebCorp
Multilingual
(alphabetical order)
- Bank of Swedish
- Oslo Corpus of Bosnian
- Croatian National Corpus (HNK)
- Czech National Corpus (CNC)
- Hungarian National Corpus
- IPI PAN Corpus of Polish
- Portuguese Corpus
- Russian National Corpus (RNK)
- Slovak National Corpus (SNK)
- Slovenian Corpus FIDA and FIDA+
- Spanish Corpus
- Tanaka Corpus: Japanese-English sentence pairs
Other lists of corpora
(alphabetical order)
Datasets
- Edinburgh Associative Thesaurus (EAT)
- Linguistic Data Consortium (LDC)
- MRC Psycholinguistic Database
- Noun Compound Repository
- Reuters-21578 Text Categorization Collection
- University of South Florida Free Association Norms
- WordSimilarity-353 Test Collection
Lexicons
- Catvar 2.0: The Categorial Variation Database
- General Inquirer
- JMdict: Japanese-Multilingual Dictionary file
- LCS Database: Lexical Conceptual Structures
- Moby lexicon project
- ThoughtTreasure
- WordNet - the original
- eXtended WordNet - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
- WordNet Domains - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT
- SentiWordNet - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity