Difference between revisions of "Corpora, datasets, lexicons"

From ACL Wiki
Jump to navigation Jump to search
(Resources page absorbed this contents; this page now redirects to it)
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Miscellaneous ==
+
#redirect [[Resources]]
 
 
* [[Resources]]
 
 
 
== Corpora ==
 
 
 
=== English ===
 
(alphabetical order)
 
* [http://americannationalcorpus.org/ American National Corpus (ANC)]
 
* [http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
 
* [http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 
* [http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 
* [http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 
* [http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 
* [http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
 
* [http://www.webcorp.org.uk/guide/ WebCorp]
 
 
 
=== Multilingual ===
 
(alphabetical order)
 
* [http://spraakbanken.gu.se/ Bank of Swedish]
 
* [http://www.tekstlab.uio.no/Bosnian/Corpus.html Oslo Corpus of Bosnian]
 
* [http://hnk.ffzg.hr/ Croatian National Corpus (HNK)]
 
* [http://ucnk.ff.cuni.cz/ Czech National Corpus (CNC)]
 
* [http://corpus.nytud.hu/mnsz/ Hungarian National Corpus]
 
* [http://korpus.pl/ IPI PAN Corpus of Polish]
 
* [http://www.corpusdoportugues.org/ Portuguese Corpus]
 
* [http://www.ruscorpora.ru/ Russian National Corpus (RNK)]
 
* [http://korpus.juls.savba.sk/ Slovak National Corpus (SNK)]
 
* [http://www.fida.net/ Slovenian Corpus FIDA] and [http://www.fidaplus.net/ FIDA+]
 
* [http://www.corpusdelespanol.org/ Spanish Corpus]
 
* [http://www.csse.monash.edu.au/~jwb/tanakacorpus.html Tanaka Corpus: Japanese-English sentence pairs]
 
 
 
=== Other lists of corpora ===
 
(alphabetical order)
 
* [http://devoted.to/corpora David Lee's Bookmarks for Corpus-based Linguists]
 
 
 
== Datasets ==
 
 
 
* [http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)]
 
* [http://www.ldc.upenn.edu/ Linguistic Data Consortium (LDC)]
 
* [http://www.psych.rl.ac.uk/ MRC Psycholinguistic Database]
 
* [http://www.cs.utexas.edu/~mfkb/nn/ Noun Compound Repository]
 
* [http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html Reuters-21578 Text Categorization Collection]
 
* [http://w3.usf.edu/FreeAssociation/ University of South Florida Free Association Norms]
 
* [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html WordSimilarity-353 Test Collection]
 
 
 
== Lexicons ==
 
(alphabetical order)
 
* [http://clipdemos.umiacs.umd.edu/catvar/ Catvar 2.0: The Categorial Variation Database] - for example, the ''developing'' cluster: {''develop'' (V), ''developer'' (N), ''developed'' (AJ), ''developing'' (N), ''developing'' (AJ), ''development'' (N)}
 
* [http://www.wjh.harvard.edu/%7Einquirer/spreadsheet_guide.htm General Inquirer]
 
* [http://www.csse.monash.edu.au/~jwb/edict_doc.html JMdict: Japanese-Multilingual Dictionary file]
 
* [http://www.umiacs.umd.edu/~bonnie/LCS_Database_Documentation.html LCS Database: Lexical Conceptual Structures]
 
* [http://www.dcs.shef.ac.uk/research/ilash/Moby/ Moby lexicon project]
 
* [http://www.signiform.com/tt/htm/tt.htm ThoughtTreasure]
 
 
 
=== WordNet and enhancements ===
 
(alphabetical order)
 
* [http://xwn.hlt.utdallas.edu/ eXtended WordNet] - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
 
* [http://patty.isti.cnr.it/~esuli/software/SentiWordNet/ SentiWordNet] - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity
 
* [http://wordnet.princeton.edu/ WordNet] - the original
 
* [http://tcc.itc.it/research/textec/topics/disambiguation/wordnetdomains.html WordNet Domains] - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT
 

Latest revision as of 19:03, 15 November 2006

Redirect to: