Word sense disambiguation resources

From ACL Wiki
Revision as of 05:42, 12 December 2014 by Tristan Miller (talk | contribs) (migrated from https://www.ukp.tu-darmstadt.de/research/scientific-community/ukpedia/word-sense-disambiguation/)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Word sense disambiguation (WSD) is an open problem in natural language processing concerned with determining which sense (i.e., meaning) of a word is used in a particular context. This article provides provides links to important WSD-related publications, software, corpora, and other resources.

Introductory material, overviews, and surveys

  • Word sense disambiguation (Wikipedia)
  • Word sense disambiguation (Scholarpedia)
  • Word sense disambiguation (ACLWiki)
  • Eneko Agirre and Philip Edmonds, editors. Word Sense Disambiguation: Algorithms and Applications, volume 33 of Text, Speech, and Language Technology. Springer, 2006. ISBN 978-1-4020-6870-6.
  • Advances in Word Sense Disambiguation tutorial by Rada Mihalcea and Ted Pedersen (2005)
  • Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41:10:1–10:69, February 2009. ISSN 0360-0300.
  • Nancy Ide and Jean Véronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–40, 1998. ISSN 0891-2017.
  • K. C. Litkowski. Computational lexicons and dictionaries. In Keith Brown, editor, Encyclopedia of Language and Linguistics, pages 753–761. Elsevier Science, Oxford, second edition, 2005. ISBN 978-0-08-044299-0.
  • Philip Edmonds. Lexical disambiguation. In Keith Brown, editor, Encyclopedia of Language and Linguistics, pages 607–623. Elsevier Science, Oxford, second edition, 2005. ISBN 978-0-08-044299-0.
  • David Jurafsky and James H. Martin. Speech and Language Processing, chapter Computational Lexical Semantics. Prentice Hall, second edition, 2008. ISBN 978-0131873216.
  • Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing, chapter Word Sense Disambiguation, pages 229–264. The MIT Press, 1999. ISBN 978-0262133609.
  • David Yarowsky. Word sense disambiguation. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, pages 315–338. Chapman and Hall/CRC, second edition, 2010. ISBN 978-1420085921.

Conferences, workshops, and journals

Sense inventories and other lexical resources

DANTE
A lexical database for English
GCIDE_XML
The GNU version of the Collaborative International Dictionary of English (CIDE), presented in XML
HECTOR
A 35-word English dictionary used for Senseval-1
Longman Dictionary of Contemporary English (LDOCE). Burnt Mill, Essex
Longman, 1978
This proprietary dictionary saw considerable use by the WSD research community before less restrictively licensed resources became available.
Roget's International Thesaurus. New York
Harper Collins, 1992
This proprietary thesaurus saw considerable use by the WSD research community before less restrictively licensed resources became available.
The Open Roget's Project
A free implementation of the 1911 Roget's Thesaurus.

Wordnets and associated resources

WordNet
A lexical database for English
Wordnets in the world
A list of wordnets for various languages
eXtended WordNet
A version of WordNet where the glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
Inter-version WordNet mappings
Mapping between synsets offsets in various WordNet versions
MCR
An integration of five local wordnets, the EuroWordNet Top Concept ontology, MultiWordNet Domains, and hundreds of thousands of new semantic relations and properties automatically acquired from corpora.

Annotated corpora

Alan Smeaton and Ian Quigley's image captions
8816 WordNet 1.5-annotated instances of 2304 lemmas in 2714 image captions
DSO Corpus of Sense-Tagged English
Sense-tagged word occurrences for 121 nouns and 70 verbs occurring in the Brown Corpus and Wall Street Journal corpus
HECTOR (Senseval-1)
Separate training and test corpora with 35 word types annotated with their HECTOR senses. See also Ted Pedersen's conversions.
interest
Wall Street Journal articles with 2369 instances of "interest" annotated with their LDOCE senses. See Ted Pedersen's conversions.
line, hard, serve
Wall Street Journal articles with over 12,000 instances of "line", "hard", and "serve" tagged with a subset of their WordNet 1.5 senses. See Ted Pedersen's conversions.
Open Mind Word Expert sense-tagged data
Various data sets for English, Romanian, and Hindi
Rada Mihalcea's Senseval-2 and Senseval-3 conversions into SemCor format
Senseval-2 and Senseval-3 English all-words data converted into SemCor format
SemCor
Brown Corpus texts annotated with WordNet 1.6 senses, and automatically mapped to WordNet 1.7, WordNet 1.7.1, WordNet 2.0, WordNet 2.1, WordNet 3.0
SEMiSUSANNE
33 sense-tagged and structurally annotated documents from the Brown Corpus
Sensecorpus
Automatically extracted examples for all WordNet 1.6 noun senses and topic signatures built based on those examples
Senseval-2
Three all-words sense-annotated Penn Treebank II articles comprising in total some 5000 words of running text, plus some Penn Treebank II Wall Street Journal and British National Corpus text where 75 to 300 instances of a total of 73 nouns, adjectives, and verbs have been annotated with their WordNet 1.7 senses. See also Ted Pedersen's and Rada Mihalcea's conversions.
Ted Pederson's Sense-tagged Text
Versions of the Senseval-1, Senseval-2, line, hard, serve, and interest data which have been converted to a common format (Senseval-2), POS tagged, and parsed.
TWA sense-tagged data
Sense tagged data for six words with two-way ambiguities (bass, crane, motion, palm, plant, tank)
WordNet Gloss Disambiguation Project
A corpus of WordNet 3.0 glosses with word forms disambiguated to their WordNet 3.0 senses

Software

CuiTools
A complete word sense disambiguation system that assigns senses to biomedical text based on the UMLS
DKPro WSD
A collection of software components for word sense disambiguation based on the Apache UIMA framework.
GWSD: Unsupervised Graph-based Word Sense Disambiguation
A system for unsupervised all-words graph-based word sense disambiguation
LingPipe
A Java natural language processing toolkit. A tutorial on using LingPipe for word sense disambiguation is available.
Natural Language Toolkit (NLTK)
Python modules for NLP, including a module for reading Senseval-2 data
SenseClusters
A package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods.
SenseLearner
An all-words word sense disambiguation tool
SenseTools
A suite of tools that allow for easy creation of supervised word sense disambiguation
Senseval-2 data format converters
Tools to convert between the following formats: Senseval-1, Senseval-2, Senseval-2 with conflated words, Headless Senseval-2, WePS, English Giga Word, plain text, National Library of Medicine Test Collection, Open Mind Data
WordNet::SenseRelate
Perl tools which use measures of semantic similarity and relatedness to perform word sense disambiguation
WSD Gate
A word sense disambiguation toolkit using GATE and WEKA
WSD Shell
An improved version of the Duluth-Shell which was used as a driver for the Duluth Senseval-2 and Senseval-3 systems