Difference between revisions of "Corpora, datasets, lexicons"
Jump to navigation
Jump to search
Line 39: | Line 39: | ||
* [http://clipdemos.umiacs.umd.edu/catvar/ Catvar 2.0: The Categorial Variation Database] | * [http://clipdemos.umiacs.umd.edu/catvar/ Catvar 2.0: The Categorial Variation Database] | ||
− | |||
* [http://www.wjh.harvard.edu/%7Einquirer/spreadsheet_guide.htm General Inquirer] | * [http://www.wjh.harvard.edu/%7Einquirer/spreadsheet_guide.htm General Inquirer] | ||
* [http://www.umiacs.umd.edu/~bonnie/LCS_Database_Documentation.html LCS Database: Lexical Conceptual Structures] | * [http://www.umiacs.umd.edu/~bonnie/LCS_Database_Documentation.html LCS Database: Lexical Conceptual Structures] | ||
* [http://www.dcs.shef.ac.uk/research/ilash/Moby/ Moby lexicon project] | * [http://www.dcs.shef.ac.uk/research/ilash/Moby/ Moby lexicon project] | ||
− | |||
* [http://www.signiform.com/tt/htm/tt.htm ThoughtTreasure] | * [http://www.signiform.com/tt/htm/tt.htm ThoughtTreasure] | ||
− | * [http://wordnet.princeton.edu/ WordNet] | + | |
− | * [http://tcc.itc.it/research/textec/topics/disambiguation/wordnetdomains.html WordNet Domains] | + | ===WordNet and enhancements=== |
+ | |||
+ | * [http://wordnet.princeton.edu/ WordNet] - the original | ||
+ | * [http://xwn.hlt.utdallas.edu/ eXtended WordNet] - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated | ||
+ | * [http://tcc.itc.it/research/textec/topics/disambiguation/wordnetdomains.html WordNet Domains] - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT | ||
+ | * [http://patty.isti.cnr.it/~esuli/software/SentiWordNet/ SentiWordNet] - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity |
Revision as of 09:39, 1 November 2006
Miscellaneous
Corpora
- American National Corpus (ANC)
- Biomedical corpora
- The Oslo Corpus of Bosnian
- British National Corpus (BNC)
- Brown Corpus
- Collins Wordbanks
- Croatian National Corpus (HNK)
- Czech National Corpus (CNC)
- David Lee's Bookmarks for Corpus-based Linguists
- Gutenberg
- Hungarian National Corpus
- IPI PAN Corpus of Polish
- Oxford English Corpus
- Portuguese Corpus
- Russian National Corpus (RNK)
- Slovak National Corpus (SNK)
- Slovenian Corpus FIDA and FIDA+
- Spanish Corpus
- Bank of Swedish
- WebCorp
Datasets
- Edinburgh Associative Thesaurus (EAT)
- Linguistic Data Consortium (LDC)
- MRC Psycholinguistic Database
- Noun Compound Repository
- Reuters-21578 Text Categorization Collection
- University of South Florida Free Association Norms
- WordSimilarity-353 Test Collection
Lexicons
- Catvar 2.0: The Categorial Variation Database
- General Inquirer
- LCS Database: Lexical Conceptual Structures
- Moby lexicon project
- ThoughtTreasure
WordNet and enhancements
- WordNet - the original
- eXtended WordNet - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
- WordNet Domains - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT
- SentiWordNet - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity