Difference between revisions of "Corpora, datasets, lexicons"
Jump to navigation
Jump to search
Line 47: | Line 47: | ||
== Lexicons == | == Lexicons == | ||
(alphabetical order) | (alphabetical order) | ||
− | * [http://clipdemos.umiacs.umd.edu/catvar/ Catvar 2.0: The Categorial Variation Database] | + | * [http://clipdemos.umiacs.umd.edu/catvar/ Catvar 2.0: The Categorial Variation Database] - for example, the ''developing'' cluster: {''develop'' (V), ''developer'' (N), ''developed'' (AJ), ''developing'' (N), ''developing'' (AJ), ''development'' (N)} |
* [http://www.wjh.harvard.edu/%7Einquirer/spreadsheet_guide.htm General Inquirer] | * [http://www.wjh.harvard.edu/%7Einquirer/spreadsheet_guide.htm General Inquirer] | ||
* [http://www.csse.monash.edu.au/~jwb/edict_doc.html JMdict: Japanese-Multilingual Dictionary file] | * [http://www.csse.monash.edu.au/~jwb/edict_doc.html JMdict: Japanese-Multilingual Dictionary file] |
Revision as of 06:50, 2 November 2006
Miscellaneous
Corpora
English
(alphabetical order)
- American National Corpus (ANC)
- Biomedical corpora
- British National Corpus (BNC)
- Brown Corpus
- Collins Wordbanks
- Gutenberg
- Oxford English Corpus
- WebCorp
Multilingual
(alphabetical order)
- Bank of Swedish
- Oslo Corpus of Bosnian
- Croatian National Corpus (HNK)
- Czech National Corpus (CNC)
- Hungarian National Corpus
- IPI PAN Corpus of Polish
- Portuguese Corpus
- Russian National Corpus (RNK)
- Slovak National Corpus (SNK)
- Slovenian Corpus FIDA and FIDA+
- Spanish Corpus
- Tanaka Corpus: Japanese-English sentence pairs
Other lists of corpora
(alphabetical order)
Datasets
- Edinburgh Associative Thesaurus (EAT)
- Linguistic Data Consortium (LDC)
- MRC Psycholinguistic Database
- Noun Compound Repository
- Reuters-21578 Text Categorization Collection
- University of South Florida Free Association Norms
- WordSimilarity-353 Test Collection
Lexicons
(alphabetical order)
- Catvar 2.0: The Categorial Variation Database - for example, the developing cluster: {develop (V), developer (N), developed (AJ), developing (N), developing (AJ), development (N)}
- General Inquirer
- JMdict: Japanese-Multilingual Dictionary file
- LCS Database: Lexical Conceptual Structures
- Moby lexicon project
- ThoughtTreasure
WordNet and enhancements
(alphabetical order)
- eXtended WordNet - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
- SentiWordNet - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity
- WordNet - the original
- WordNet Domains - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT