Word sense disambiguation resources
Jump to navigation
Jump to search
Word sense disambiguation (WSD) is an open problem in natural language processing concerned with determining which sense (i.e., meaning) of a word is used in a particular context. This article provides provides links to important WSD-related publications, software, corpora, and other resources.
Introductory material, overviews, and surveys
- Word sense disambiguation (Wikipedia)
- Word sense disambiguation (Scholarpedia)
- Word sense disambiguation (ACLWiki)
- Eneko Agirre and Philip Edmonds, editors. Word Sense Disambiguation: Algorithms and Applications, volume 33 of Text, Speech, and Language Technology. Springer, 2006. ISBN 978-1-4020-6870-6.
- Advances in Word Sense Disambiguation tutorial by Rada Mihalcea and Ted Pedersen (2005)
- Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41:10:1–10:69, February 2009. ISSN 0360-0300.
- Nancy Ide and Jean Véronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–40, 1998. ISSN 0891-2017.
- K. C. Litkowski. Computational lexicons and dictionaries. In Keith Brown, editor, Encyclopedia of Language and Linguistics, pages 753–761. Elsevier Science, Oxford, second edition, 2005. ISBN 978-0-08-044299-0.
- Philip Edmonds. Lexical disambiguation. In Keith Brown, editor, Encyclopedia of Language and Linguistics, pages 607–623. Elsevier Science, Oxford, second edition, 2005. ISBN 978-0-08-044299-0.
- David Jurafsky and James H. Martin. Speech and Language Processing, chapter Computational Lexical Semantics. Prentice Hall, second edition, 2008. ISBN 978-0131873216.
- Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing, chapter Word Sense Disambiguation, pages 229–264. The MIT Press, 1999. ISBN 978-0262133609.
- David Yarowsky. Word sense disambiguation. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, pages 315–338. Chapman and Hall/CRC, second edition, 2010. ISBN 978-1420085921.
Conferences, workshops, and journals
- The International Committee on Computational Linguistics (ICCL) and its conferences:
- The Association for Computational Linguistics (ACL) and its associated organizations, conferences, workshops, and special interest groups:
- ACL SIGLEX, the umbrella organization for the Semeval and Senseval evaluation exercises:
- Senseval-1 (1998)
- Senseval-2 (2001)
- Senseval-3 (2004)
- Semeval-1 (2007)
- Semeval-2 (2010)
- Semeval-3 (2013)
- ACL SIGLEX, the umbrella organization for the Semeval and Senseval evaluation exercises:
- Robust WSD task at the Cross Language Evaluation Form (CLEF)
- Computational Linguistics. MIT Press. ISSN 0891-2017.
- Natural Language Engineering. Cambridge University Press. ISSN 1351-3249.
Sense inventories and other lexical resources
- DANTE
- A lexical database for English
- GCIDE_XML
- The GNU version of the Collaborative International Dictionary of English (CIDE), presented in XML
- HECTOR
- A 35-word English dictionary used for Senseval-1
- Longman Dictionary of Contemporary English (LDOCE). Burnt Mill, Essex
- Longman, 1978
- This proprietary dictionary saw considerable use by the WSD research community before less restrictively licensed resources became available.
- Roget's International Thesaurus. New York
- Harper Collins, 1992
- This proprietary thesaurus saw considerable use by the WSD research community before less restrictively licensed resources became available.
- The Open Roget's Project
- A free implementation of the 1911 Roget's Thesaurus.
Wordnets and associated resources
- WordNet
- A lexical database for English
- Wordnets in the world
- A list of wordnets for various languages
- eXtended WordNet
- A version of WordNet where the glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
- Inter-version WordNet mappings
- Mapping between synsets offsets in various WordNet versions
- MCR
- An integration of five local wordnets, the EuroWordNet Top Concept ontology, MultiWordNet Domains, and hundreds of thousands of new semantic relations and properties automatically acquired from corpora.
Annotated corpora
- Alan Smeaton and Ian Quigley's image captions
- 8816 WordNet 1.5-annotated instances of 2304 lemmas in 2714 image captions
- DSO Corpus of Sense-Tagged English
- Sense-tagged word occurrences for 121 nouns and 70 verbs occurring in the Brown Corpus and Wall Street Journal corpus
- HECTOR (Senseval-1)
- Separate training and test corpora with 35 word types annotated with their HECTOR senses. See also Ted Pedersen's conversions.
- interest
- Wall Street Journal articles with 2369 instances of "interest" annotated with their LDOCE senses. See Ted Pedersen's conversions.
- line, hard, serve
- Wall Street Journal articles with over 12,000 instances of "line", "hard", and "serve" tagged with a subset of their WordNet 1.5 senses. See Ted Pedersen's conversions.
- Open Mind Word Expert sense-tagged data
- Various data sets for English, Romanian, and Hindi
- Rada Mihalcea's Senseval-2 and Senseval-3 conversions into SemCor format
- Senseval-2 and Senseval-3 English all-words data converted into SemCor format
- SemCor
- Brown Corpus texts annotated with WordNet 1.6 senses, and automatically mapped to WordNet 1.7, WordNet 1.7.1, WordNet 2.0, WordNet 2.1, WordNet 3.0
- SEMiSUSANNE
- 33 sense-tagged and structurally annotated documents from the Brown Corpus
- Sensecorpus
- Automatically extracted examples for all WordNet 1.6 noun senses and topic signatures built based on those examples
- Senseval-2
- Three all-words sense-annotated Penn Treebank II articles comprising in total some 5000 words of running text, plus some Penn Treebank II Wall Street Journal and British National Corpus text where 75 to 300 instances of a total of 73 nouns, adjectives, and verbs have been annotated with their WordNet 1.7 senses. See also Ted Pedersen's and Rada Mihalcea's conversions.
- Ted Pederson's Sense-tagged Text
- Versions of the Senseval-1, Senseval-2, line, hard, serve, and interest data which have been converted to a common format (Senseval-2), POS tagged, and parsed.
- TWA sense-tagged data
- Sense tagged data for six words with two-way ambiguities (bass, crane, motion, palm, plant, tank)
- WordNet Gloss Disambiguation Project
- A corpus of WordNet 3.0 glosses with word forms disambiguated to their WordNet 3.0 senses
Software
- CuiTools
- A complete word sense disambiguation system that assigns senses to biomedical text based on the UMLS
- DKPro WSD
- A collection of software components for word sense disambiguation based on the Apache UIMA framework.
- GWSD: Unsupervised Graph-based Word Sense Disambiguation
- A system for unsupervised all-words graph-based word sense disambiguation
- LingPipe
- A Java natural language processing toolkit. A tutorial on using LingPipe for word sense disambiguation is available.
- Natural Language Toolkit (NLTK)
- Python modules for NLP, including a module for reading Senseval-2 data
- SenseClusters
- A package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods.
- SenseLearner
- An all-words word sense disambiguation tool
- SenseTools
- A suite of tools that allow for easy creation of supervised word sense disambiguation
- Senseval-2 data format converters
- Tools to convert between the following formats: Senseval-1, Senseval-2, Senseval-2 with conflated words, Headless Senseval-2, WePS, English Giga Word, plain text, National Library of Medicine Test Collection, Open Mind Data
- WordNet::SenseRelate
- Perl tools which use measures of semantic similarity and relatedness to perform word sense disambiguation
- WSD Gate
- A word sense disambiguation toolkit using GATE and WEKA
- WSD Shell
- An improved version of the Duluth-Shell which was used as a driver for the Duluth Senseval-2 and Senseval-3 systems