Knowledge collections and datasets (English)

Knowledge collections and datasets for Computational Linguistics and Natural Language Processing.

For languages other than English, see List of resources by language.

Clustering by Committee - terms clustered and organized using the Distributional Hypothesis
DIRT Paraphrase Collection - Discovery of Inference Rules from Text
Edinburgh Associative Thesaurus (EAT)
FrameNet
MRC Psycholinguistic Database
Preposition Project
Noun Compound Repository
Reuters-21578 Text Categorization Collection
SAT Analogy Questions - a way of evaluating algorithms for measuring relational similarity
Spam filtering datasets
TEASE - Acquisition of Entailment Relations from the Web
TOEFL Synonym Questions - a way of evaluating algorithms for measuring degree of similarity between 2 words
RG-65 Test Collection - suitable for correlation-based evaluation of algorithms for measuring semantic similarity of word pairs
University of South Florida Free Association Norms
VerbOcean - verbs organized by semantic relation, including temporal precedence and strength
WordNet

Wordnet Annotated Corpora A relatively complete list of wordnet annotated corpora, both in English and other languages

See also NLG:Data sets for a collection of data sets used for building natural language generation systems.

Additional Dataset Collections