Knowledge collections and datasets (English)
Jump to navigation
Jump to search
Knowledge collections and datasets for Computational Linguistics and Natural Language Processing.
For languages other than English, see List of resources by language.
- Clustering by Committee - terms clustered and organized using the Distributional Hypothesis
- DIRT Paraphrase Collection - Discovery of Inference Rules from Text
- Edinburgh Associative Thesaurus (EAT)
- FrameNet
- MRC Psycholinguistic Database
- Preposition Project
- Noun Compound Repository
- Reuters-21578 Text Categorization Collection
- SAT Analogy Questions - a way of evaluating algorithms for measuring relational similarity
- Spam filtering datasets
- TEASE - Acquisition of Entailment Relations from the Web
- TOEFL Synonym Questions - a way of evaluating algorithms for measuring degree of similarity between 2 words
- RG-65 Test Collection - suitable for correlation-based evaluation of algorithms for measuring semantic similarity of word pairs
- University of South Florida Free Association Norms
- VerbOcean - verbs organized by semantic relation, including temporal precedence and strength
- WordNet
- WordSimilarity-353 Test Collection
See also NLG:Data sets for a collection of data sets used for building natural language generation systems.