Difference between revisions of "Knowledge collections and datasets (English)"

Latest revision as of 19:19, 17 November 2013

Knowledge collections and datasets for Computational Linguistics and Natural Language Processing.

For languages other than English, see List of resources by language.

Clustering by Committee - terms clustered and organized using the Distributional Hypothesis
DIRT Paraphrase Collection - Discovery of Inference Rules from Text
Edinburgh Associative Thesaurus (EAT)
FrameNet
MRC Psycholinguistic Database
Preposition Project
Noun Compound Repository
Reuters-21578 Text Categorization Collection
SAT Analogy Questions - a way of evaluating algorithms for measuring relational similarity
Spam filtering datasets
TEASE - Acquisition of Entailment Relations from the Web
TOEFL Synonym Questions - a way of evaluating algorithms for measuring degree of similarity between 2 words
RG-65 Test Collection - suitable for correlation-based evaluation of algorithms for measuring semantic similarity of word pairs
University of South Florida Free Association Norms
VerbOcean - verbs organized by semantic relation, including temporal precedence and strength
WordNet

Wordnet Annotated Corpora A relatively complete list of wordnet annotated corpora, both in English and other languages

WordSimilarity-353 Test Collection

See also NLG:Data sets for a collection of data sets used for building natural language generation systems.

Additional Dataset Collections

Linguistic Data Consortium (LDC)

@@ Line 1: / Line 1: @@
-Datasets for Computational Linguistics and Natural Language Processing.
+Knowledge collections and datasets for Computational Linguistics and Natural Language Processing.
+For languages other than English, see [[List of resources by language]].
+<!-- Please keep this list in alphabetical order -->
 * [[Clustering by Committee]] - terms clustered and organized using the [[Distributional Hypothesis]]
 * [[DIRT Paraphrase Collection]] - Discovery of Inference Rules from Text
@@ Line 6: / Line 9: @@
 * [http://framenet.icsi.berkeley.edu/ FrameNet]
 * [http://www.psych.rl.ac.uk/ MRC Psycholinguistic Database]
+* [http://www.clres.com/prepositions.html Preposition Project]
 * [[Noun compound repository|Noun Compound Repository]]
 * [http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html Reuters-21578 Text Categorization Collection]
+* [[SAT Analogy Questions]] - a way of evaluating algorithms for measuring relational similarity
 * [[Spam filtering datasets]]
 * [[TEASE]] - Acquisition of Entailment Relations from the Web
+* [[TOEFL Synonym Questions]] - a way of evaluating algorithms for measuring degree of similarity between 2 words
+* [[RG-65 Test Collection (State of the art)|RG-65 Test Collection]] - suitable for correlation-based evaluation of algorithms for measuring semantic similarity of word pairs
 * [http://w3.usf.edu/FreeAssociation/ University of South Florida Free Association Norms]
 * [[VerbOcean]] - verbs organized by semantic relation, including temporal precedence and strength
 * [[WordNet]]
+:* [http://globalwordnet.org/?page_id=241 Wordnet Annotated Corpora] A relatively complete list of wordnet annotated corpora, both in English and other languages
 * [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html WordSimilarity-353 Test Collection]
+See also [[NLG:Data sets]] for a collection of data sets used for building natural language generation systems.
 == Additional Dataset Collections ==

Difference between revisions of "Knowledge collections and datasets (English)"

Latest revision as of 19:19, 17 November 2013

Additional Dataset Collections

Navigation menu

Search