ACL Wiki - User contributions [en]

Resources for Guarani

2015-10-22T06:32:57Z

Kiwibird: Created page with "* https://code.google.com/p/hltdi-l3/source/browse/#hg%2Fdicts bilingual dictionaries es-gn, GPL3"

* https://code.google.com/p/hltdi-l3/source/browse/#hg%2Fdicts bilingual dictionaries es-gn, GPL3

Resources for Norwegian

2015-09-01T18:10:56Z

Kiwibird: /* Free software */

==Corpora==
===Free software===
* http://www.nb.no/sprakbanken/show?serial=sbr-36&lang=nn CC-BY EU-corpora (Acquis Communautaire), translations in tmx format from English to Nynorsk (52527 tu's) and Bokmål (733081 tu's)

===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Norwegian plain text and Co-occurrences at LCC] ("the corpora may be used for scientific purposes only and not passed on to third parties")

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wdtno/ Ord I Dag]

==Machine translation systems==

===Free software===

* [http://www.apertium.org Apertium] Norwegian Nynorsk<->Norwegian Bokmål, GPL v2
** [http://wiki.apertium.org/wiki/Apertium-nn-nb wiki] with installation information etc.

===Proprietary===

==Lexical resources==
===Free software===
* [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg Bokmål LFG lexicon] with POS and count/mass, GPL
* [http://www.edd.uio.no/prosjekt/ordbanken/ Norsk ordbank], full form dictionaries for Nynorsk (106,789 lemmata) and Bokmål (142,899 lemmata), GPL
** [http://savannah.nongnu.org/projects/ordbanken/ alternative download with cli lookup interface]
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar SCARRIE, Bokmål full form dictionary], XML, about 75,000 lemmata, CC-BY unported

===Unknown license===
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar "Leksikalsk database for norsk, opphavleg produsert av NST"], lexical database with SAMPA transcriptions, meant for speech technology

==Parsing/disambiguation==
===Free software===
* [http://www.hf.uio.no/tekstlab/tagger.html Oslo-Bergen-taggeren], [[Constraint Grammar]] disambiguator, GPL
** [https://github.com/noklesta/The-Oslo-Bergen-Tagger source and packages on github]
** [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger older alternative download site]
** [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/ the version used in Apertium]
** [https://github.com/ogrim/clj-obt Clojure bindings]

* [http://www.hf.ntnu.no/hf/isk/Ansatte/petter.haugereid/norsyg.html Norsyg], [[HPSG]] grammar for Norwegian bokmål, LGPL. Implemented in [[LKB]], works with the full ''Norsk ordbank'' lexicon.

[[Category:Resources by language|Norwegian]]

Resources for Persian

2015-08-11T18:45:04Z

Kiwibird: /* Morphology tools */

== Corpora ==
===Free===
*[http://www.ling.ohio-state.edu/~jonsafari/corpora VOA Persian Corpus 2003-2008] (public domain)

===Proprietary===

*[http://ece.ut.ac.ir/DBRG/Bijankhan/ Bijankhan corpus] (gratis for research/non-commercial purposes)
*[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S50 CALLFRIEND Farsi (speech)], LDC
*[http://ece.ut.ac.ir/dbrg/hamshahri/ Hamshahri corpus] (gratis for research/non-commercial purposes)
*[http://www.elda.org/catalogue/en/speech/S0112.html Persian speech database Farsdat], ELRA

==Lexical resources==
===Free===
*[http://www.ling.ohio-state.edu/~jonsafari/corpora/wikipedia_fa-en_20120217.txt.xz Persian - English dictionary], derived from Wikipedia article names. Retains Wikipedia's CC-BY-SA 3.0 license.

===Proprietary===
*[http://pwn.ir Persian WordNet]

==Machine translation==
===Free===
*[http://ece.ut.ac.ir/node/100869?destination=node%2F100869 Tehran English-Persian Parallel Corpus] by Mohammad Taher Pilevar, NLP Lab, University of Tehran. For research or non-commercial use.

===Proprietary===
*[http://crl.nmsu.edu/Research/Projects/shiraz/index.html The Shiraz project] (Persian -> English)

==Morphology tools==
===Free===
*[http://sourceforge.net/projects/perstem Perstem] - Persian stemmer, light morphological analyzer, and character set converter.
*[http://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-tg-fa/apertium-tg-fa.fa.dix Morphological dictionary] — compiled using [[lttoolbox]].
*[http://stp.lingfil.uu.se/~mojgan/ BLARK by Mojgan Seraji] – normaliser, tokeniser, segmentation, hunpos model for PoS-tagging and (java) dependency parser, all GPL

==Parsing==
===Free===
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://www.ling.ohio-state.edu/~jonsafari/persianlg/ Persian dictionaries] for the [http://www.abisource.com/projects/link-grammar/ Link-Grammar parser]. By [http://www.ling.ohio-state.edu/~jonsafari/ Jon Dehdari]. These require the Perstem stemming package, above.
* [http://stp.lingfil.uu.se/~mojgan/UPDT.html Uppsala Persian Dependency Treebank], Creative Commons Attribution 3.0 License

===Proprietary===
*[http://dadegan.ir/en/persiandependencytreebank Dadegan Dependency Treebank] for research purposes only.
*[http://hpsg.fu-berlin.de/~ghayoomi/PTB.html HPSG Persian Treebank (PerTreeBank)] for academic research purposes only.

==Bibliography==
* Dehdari, Jon, and Deryle Lonsdale. 2008. [http://www.ling.ohio-state.edu/~jonsafari/papers/dehdari_lonsdale_2005.pdf A link grammar parser for Persian]. In Karimi, S., Samiian, V., and Stilo, D., editors, ''Aspects of Iranian Linguistics'', volume 1. Cambridge Scholars Press. ISBN: 978-18-471-8639-3 ([http://www.ling.ohio-state.edu/~jonsafari/bib/dehdarilonsdale2005.bib.txt BIB])

* Feili, H. and G. Ghassem-Sani (2004) "[http://sharif.edu/~sani/papers/Feili_SaniE2.pdf An Application of Lexicalized Grammars in English-Persian Translation]". ''Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004)'', 24-27 Aug. 2004, Universidad Politecnica de Valencia, Valencia, Spain, pp. 596-600.
* Megerdoomian, K. (2000) "[http://crl.nmsu.edu/Research/Projects/shiraz/publications/papers/Cicling.pdf Unification-Based Persian Morphology]". ''Proceedings of CICLing 2000'', Alexander Gelbukh, Center of Investigation on Computation-IPN, Mexico, 2000.
* Megerdoomian, K. (2004) "[http://acl.ldc.upenn.edu/coling2004/W5/pdf/W5-7.pdf Finite-State Morphological Analysis of Persian]". ''COLING 2004 Computational Approaches to Arabic Script-based Languages''. Ali Farghaly and Karine Megerdoomian editors, Geneva, Switzerland, 2004, pgs. 35-41.
* Mohammad Amin Farajian (2011). [http://world-comp.org/p2011/ICA4953.pdf PEN: Parallel English-Persian News Corpus]. Proceedings of 2011 International Conference on Artificial Intelligence (ICAI'11), Nevada, USA.

==See also==
*[[Resources for Kurdish]]
*[[Resources for Tajik]]

==External links==
*https://wiki.iranianlinguistics.org/wiki/Main_Page: NLP Resources for Persian]
*[http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html the Jon safari] (link parser, small lexicon, stemmer, morphological analysis tools)

[[Category:Resources by language|Persian]]

Resources for Tatar

2015-06-19T10:37:54Z

Kiwibird: /* Corpora */

==Morphology==

===Free software===
* http://wiki.apertium.org/wiki/Apertium-tat GPL analyser and disambiguator

===Proprietary===

==Machine translation==

===Free software===
* http://wiki.apertium.org/wiki/Kazakh_and_Tatar rule-based, GPL license
* http://wiki.apertium.org/wiki/Tatar_and_Russian rule-based, GPL license

===Proprietary===
* http://tatar.com.ru/trans.php

==Corpora==
===Proprietary===
* http://corpus.tatar/index_en.php (so far only freely searchable for non-commercial scientific or educational use)

===Free/open licence===

* http://tatoeba.org/deu/sentences/show_all_in/tat/none/none/indifferent Tatoeba sentences in Tatar

==Bibliography==

==External links==

[[Category:Resources by language|Tatar]]

Resources for Tatar

2015-06-19T10:37:29Z

Kiwibird: Created page with "==Morphology== ===Free software=== * http://wiki.apertium.org/wiki/Apertium-tat GPL analyser and disambiguator ===Proprietary=== ==Machine translation== ===Free software==..."

==Morphology==

===Free software===
* http://wiki.apertium.org/wiki/Apertium-tat GPL analyser and disambiguator

===Proprietary===

==Machine translation==

===Free software===
* http://wiki.apertium.org/wiki/Kazakh_and_Tatar rule-based, GPL license
* http://wiki.apertium.org/wiki/Tatar_and_Russian rule-based, GPL license

===Proprietary===
* http://tatar.com.ru/trans.php

==Corpora==
===Proprietary===
* http://corpus.tatar/index_en.php (so far only freely searchable for non-commercial scientific or educational use)

===Free/open licence===

==Bibliography==

==External links==

[[Category:Resources by language|Tatar]]

Resources for Arabic

2015-04-20T09:21:26Z

Kiwibird: /* Morphology */

==Morphology==

===Free software===
*[https://sourceforge.net/projects/aramorph/ AraMorph - Perl] - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter)
*[http://www.nongnu.org/aramorph/ AraMorph - Java] - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for [http://lucene.apache.org/ Lucene]

===Proprietary===
*[http://www.arabic-morphology.com Xerox Arabic Morphological Analyzer and Generator]

==WordNets==

===Free software===
* http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets

===Proprietary===
* http://babelnet.org/ (available for download for "Non-Commercial" use)

==Parsers==
===Free software===
* [http://www.cis.upenn.edu/~dbikel/software.html#stat-parser Bikel's implementation of Collins Parser] by [http://www.cis.upenn.edu/~dbikel/ Dan Bikel].
* [http://www.ling.ohio-state.edu/~jonsafari/arabiclg/arabiclg.20060829.tar.bz2 Arabic dictionaries], by [http://www.ling.ohio-state.edu/~jonsafari/ Jon Dehdari], for the [http://www.abisource.com/projects/link-grammar/ Link-Grammar parser]. These require the Aramorph stemming package, above.
* [https://sourceforge.net/apps/trac/elixir-fm/wiki ElixirFM] ([http://quest.ms.mff.cuni.cz/cgi-bin/elixir/index.fcgi online interface here]) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser.
* [http://sourceforge.net/projects/sarf Sarf] - Arabic Morphology System (all in Java)

==Corpora==
===Proprietary===
*[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1], 76 million tokens, annotation: paragraphs

===Free/open licence===
* [http://github.com/anastaw/Meedan-Memory Meedan-Memory], Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, [http://www.opendatacommons.org/licenses/odbl/ Open Database Licence]
* [http://quran.uk.net/ Quranic Arabic Corpus], 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; [[dependency grammar]] annotation is planned.
* [http://www1.ccls.columbia.edu/~ybenajiba/downloads.html Arabic NER corpora] by [http://www1.ccls.columbia.edu/~ybenajiba/ Yassine Benajiba], 150,000+ words.
* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.

==Bibliography==

==External links==
*[http://www.elsnet.org/acl2001-arabic.html ACL/EACL 2001 Workshop on Arabic NLP]
*[http://www1.cs.columbia.edu/~mdiab/software/ASVMTools_2.0.tar.gz Basic Arabic Processing Tools]
*[http://acl.ldc.upenn.edu/coling2004/W5/index.html COLING 2004 Workshop on computational approaches to Arabic script-based languages]

[[Category:Resources by language|Arabic]]

Resources for Hebrew

2015-04-20T09:20:02Z

Kiwibird:

* [http://www.ivrix.org.il/projects/spell-checker/ Hebrew Spellchecker]
* [http://search.cpan.org/dist/Lingua-HE-Sentence/ Lingua-HE-Sentence-0.13] - Perl module for splitting Hebrew text into sentences
* http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets

[[Category:Resources by language|Hebrew]]

Resources for Hebrew

2015-04-20T09:19:48Z

Kiwibird:

* [http://www.ivrix.org.il/projects/spell-checker/ Hebrew Spellchecker]
* [http://search.cpan.org/dist/Lingua-HE-Sentence/ Lingua-HE-Sentence-0.13] - Perl module for splitting Hebrew text into sentences
* http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Wordnets

[[Category:Resources by language|Hebrew]]

Resources for Thai

2014-04-10T11:51:40Z

Kiwibird: Created page with " https://github.com/veer66/Yaitron Yaitron English-Thai and Thai-English XML dictionary, license seems standard 4-clause"

https://github.com/veer66/Yaitron Yaitron English-Thai and Thai-English XML dictionary, license seems standard 4-clause

List of resources by language

2014-04-10T11:51:16Z

Kiwibird: /* T */

List of pages which give links and commentary on computational resources by language.

Quick links:

* [[Resources for English]]
* [[Multilingual resources|Resources for Multilingual Applications]]

See also:

* [http://www.ethnologue.com/ Ethnologue: Languages of the World]
* [[Language Identification Tools]]

==A==
__NOTOC__
{{compactTOC2}}
* [[Resources for Albanian]]
* [[Resources for Amharic]]
* [[Resources for Arabic]]
* [[Resources for Afrikaans]]

==B==
__NOTOC__
{{compactTOC2}}
* [[Resources for Basque]]
* [[Resources for Bulgarian]]
* [[Resources for Breton]]

==C==
__NOTOC__
{{compactTOC2}}
* [[Resources for Catalan]]
* [[Resources for Chinese]]
* [[Resources for Croatian]] (see also [[Resources for Serbian]], [[Resources for Bosnian]], [[Resources for Serbo-Croatian]])
* [[Resources for Czech]]

==D==
__NOTOC__
{{compactTOC2}}
* [[Resources for Danish]]
* [[Resources for Dutch]]

==E==
__NOTOC__
{{compactTOC2}}
* [[Resources for English]]
* [[Resources for Esperanto]]
* [[Resources for Estonian]]

==F==
__NOTOC__
{{compactTOC2}}
* [[Resources for Faroese]]
* [[Resources for Finnish]]
* [[Resources for French]]

==G==
__NOTOC__
{{compactTOC2}}
* [[Resources for Galician]]
* [[Resources for Georgian]]
* [[Resources for German]]
* [[Resources for Greek]]
* [[Resources for Greenlandic]]

==H==
__NOTOC__
{{compactTOC2}}
* [[Resources for Haitian]]
* [[Resources for Hebrew]]
* [[Resources for Hindi]]
* [[Resources for Hungarian]]

==I==
__NOTOC__
{{compactTOC2}}
* [[Resources for Icelandic]]
* [[Resources for Indonesian]]
* [[Resources for Inuktitut]]
* [[Resources for Iñupiaq]]
* [[Resources for Iranian]]
* [[Resources for Italian]]
* [[Resources for Irish]]

==J==
__NOTOC__
{{compactTOC2}}
* [[Resources for Japanese]]

==K==
__NOTOC__
{{compactTOC2}}
* [[Resources for Kannada]]
* [[Resources for Korean]]
* [[Resources for Komi]]
* [[Resources for Kurdish]]

==L==
__NOTOC__
{{compactTOC2}}
* [[Resources for Lithuanian]]

==M==
__NOTOC__
{{compactTOC2}}
* [[Resources for Macedonian]]
* [[Resources for Malay]]
* [[Resources for Maltese]]
* [[Resources for Montenegrin]]
* [[Multilingual resources|Resources for Multilingual Applications]]

==N==
__NOTOC__
{{compactTOC2}}
* [[Resources for Norwegian]]
* [[Resources for Navajo]]

==O==
__NOTOC__
{{compactTOC2}}
* [[Resources for Occitan]]

==P==
__NOTOC__
{{compactTOC2}}
* [[Resources for Pashto]]
* [[Resources for Persian]]
* [[Resources for Polish]]
* [[Resources for Portugese]]
* [[Resources for Punjabi]]

==Q==
__NOTOC__
{{compactTOC2}}
* [[Resources for Quechua]]

==R==
__NOTOC__
{{compactTOC2}}
* [[Resources for Romanian]]
* [[Resources for Russian]]

==S==
__NOTOC__
{{compactTOC2}}
* [[Resources for Sámi]]
* [[Resources for Sanskrit]]
* [[Resources for Slovak]]
* [[Resources for Slovenian]]
* [[Resources for Sorbian]]
* [[Resources for Spanish]]
* [[Resources for Swahili]]
* [[Resources for Swedish]]

==T==
__NOTOC__
{{compactTOC2}}
* [[Resources for Tajik]]
* [[Resources for Turkish]]
* [[Resources for Tigrinya]]
* [[Resources for Telugu]]
* [[Resources for Thai]]

==U==
__NOTOC__
{{compactTOC2}}
* [[Resources for Ukrainian]]
* [[Resources for Urdu]]

==V==
__NOTOC__
{{compactTOC2}}
* [[Resources for Vietnamese]]

==W==
__NOTOC__
{{compactTOC2}}
* [[Resources for Welsh]]

==Z==
__NOTOC__
{{compactTOC2}}
* [[Resources for Zulu]]

==See also==

* [[Resources for African languages]]

[[Category:Resources by language|*]]

Part-of-speech tagging

2012-12-29T17:25:59Z

Kiwibird: /* Software */

'''Part-of-speech tagging''' is the task of assigning a part-of-speech tag to each word in a given text.

==History==

==Further reading==

==Software==
*[http://sourceforge.net/projects/acopost ACOPOST] - a collection of taggers using maximum entropy, second order Markov, exemplar, and transformation-based models. See also [http://hermes.sourceforge.net/acopost.html this site]. Free, open source license.
*[http://danieldk.org/Code/Citar Citar] - uses [http://en.wikipedia.org/wiki/Trigram trigram]-based [http://en.wikipedia.org/wiki/Hidden_Markov_model HMM]s. Free, open source license.
*[http://crfpp.sourceforge.net CRF++] - uses [http://en.wikipedia.org/wiki/Conditional_random_field Conditional random fields]. Free, open source license (dual: LGPL, New BSD). C++.
*[http://crftagger.sourceforge.net CRFTagger] - for English. Free, open source license. Java.
*[http://sourceforge.net/projects/gposttl GPoSTTL] - Enhanced [http://en.wikipedia.org/wiki/Brill_tagger TBL] tagger for English. Open source license.
*[http://code.google.com/p/hunpos/ HunPos] - uses trigram-based HMMs. Free, open source license. OCaml.
*[http://cogcomp.cs.illinois.edu/page/software_view/3 Illinois LBJ POS Tagger] - Uses averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based sequential model. Java API, Free, open source license.
*[http://ilk.uvt.nl/mbt Memory-based tagger] (MBT) - uses [http://ilk.uvt.nl/timbl TiMBL]. Free, open source license.
*[http://ufal.mff.cuni.cz/morce/index.php Morče] - Uses Averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based model. Free, open source license (GPL2).
*[http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger] - uses [http://en.wikipedia.org/wiki/Logistic_regression Maximum entropy models]. Free, open source license. Java.
*[http://www.lsi.upc.es/~nlp/SVMTool SVMTool] - uses [http://en.wikipedia.org/wiki/Support_vector_machine Support vector machines]. Free, open source license, but depends on non-Free/open source [http://svmlight.joachims.org/ SVMlight].

==See also==
*[[POS Tagging (State of the art)]]

==External links==
*[http://en.wikipedia.org/wiki/Part-of-speech_tagging Wikipedia article on POS tagging]

[[Category:Morphology]]
[[Category:Syntax]]
[[Category:Software]]

Part-of-speech tagging

2012-12-29T17:22:43Z

Kiwibird: /* Software */

'''Part-of-speech tagging''' is the task of assigning a part-of-speech tag to each word in a given text.

==History==

==Further reading==

==Software==
*[http://sourceforge.net/projects/acopost ACOPOST] - a collection of taggers using maximum entropy, second order Markov, exemplar, and transformation-based models. See also [http://hermes.sourceforge.net/acopost.html this site]. Free, open source license.
*[http://danieldk.org/Code/Citar Citar] - uses [http://en.wikipedia.org/wiki/Trigram trigram]-based [http://en.wikipedia.org/wiki/Hidden_Markov_model HMM]s. Free, open source license.
*[http://crfpp.sourceforge.net CRF++] - uses [http://en.wikipedia.org/wiki/Conditional_random_field Conditional random fields]. Free, open source license (dual: LGPL, New BSD). C++.
*[http://crftagger.sourceforge.net CRFTagger] - for English. Free, open source license. Java.
*[http://sourceforge.net/projects/gposttl GPoSTTL] - Enhanced [http://en.wikipedia.org/wiki/Brill_tagger TBL] tagger for English. Open source license.
*[http://code.google.com/p/hunpos/ HunPos] - uses trigram-based HMMs. Free, open source license. OCaml.
*[http://cogcomp.cs.illinois.edu/page/software_view/3 Illinois LBJ POS Tagger] - Uses averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based sequential model. Java API, Free, open source license.
*[http://ilk.uvt.nl/mbt Memory-based tagger] (MBT) - uses [http://ilk.uvt.nl/timbl TiMBL]. Free, open source license.
*[http://ufal.mff.cuni.cz/morce/index.php Morče] - Uses Averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based model. Free, open source license (GPL2).
*[http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger] - uses [http://en.wikipedia.org/wiki/Logistic_regression Maximum entropy models]. Free, open source license.
*[http://www.lsi.upc.es/~nlp/SVMTool SVMTool] - uses [http://en.wikipedia.org/wiki/Support_vector_machine Support vector machines]. Free, open source license, but depends on non-Free/open source [http://svmlight.joachims.org/ SVMlight].

==See also==
*[[POS Tagging (State of the art)]]

==External links==
*[http://en.wikipedia.org/wiki/Part-of-speech_tagging Wikipedia article on POS tagging]

[[Category:Morphology]]
[[Category:Syntax]]
[[Category:Software]]

Part-of-speech tagging

2012-12-29T17:21:58Z

Kiwibird: /* Software */

'''Part-of-speech tagging''' is the task of assigning a part-of-speech tag to each word in a given text.

==History==

==Further reading==

==Software==
*[http://sourceforge.net/projects/acopost ACOPOST] - a collection of taggers using maximum entropy, second order Markov, exemplar, and transformation-based models. See also [http://hermes.sourceforge.net/acopost.html this site]. Free, open source license.
*[http://danieldk.org/Code/Citar Citar] - uses [http://en.wikipedia.org/wiki/Trigram trigram]-based [http://en.wikipedia.org/wiki/Hidden_Markov_model HMM]s. Free, open source license.
*[http://crfpp.sourceforge.net CRF++] - uses [http://en.wikipedia.org/wiki/Conditional_random_field Conditional random fields]. Free, open source license. C++.
*[http://crftagger.sourceforge.net CRFTagger] - for English. Free, open source license. Java.
*[http://sourceforge.net/projects/gposttl GPoSTTL] - Enhanced [http://en.wikipedia.org/wiki/Brill_tagger TBL] tagger for English. Open source license.
*[http://code.google.com/p/hunpos/ HunPos] - uses trigram-based HMMs. Free, open source license. OCaml.
*[http://cogcomp.cs.illinois.edu/page/software_view/3 Illinois LBJ POS Tagger] - Uses averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based sequential model. Java API, Free, open source license.
*[http://ilk.uvt.nl/mbt Memory-based tagger] (MBT) - uses [http://ilk.uvt.nl/timbl TiMBL]. Free, open source license.
*[http://ufal.mff.cuni.cz/morce/index.php Morče] - Uses Averaged [http://en.wikipedia.org/wiki/Perceptron Perceptron] based model. Free, open source license (GPL2).
*[http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger] - uses [http://en.wikipedia.org/wiki/Logistic_regression Maximum entropy models]. Free, open source license.
*[http://www.lsi.upc.es/~nlp/SVMTool SVMTool] - uses [http://en.wikipedia.org/wiki/Support_vector_machine Support vector machines]. Free, open source license, but depends on non-Free/open source [http://svmlight.joachims.org/ SVMlight].

==See also==
*[[POS Tagging (State of the art)]]

==External links==
*[http://en.wikipedia.org/wiki/Part-of-speech_tagging Wikipedia article on POS tagging]

[[Category:Morphology]]
[[Category:Syntax]]
[[Category:Software]]

POS Tagging (State of the art)

2012-12-29T17:12:57Z

Kiwibird: /* FTB */

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| CeCILL-C
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| LGPL 2.1
|-
| LAPOS
| Perceptron based training with lookahead
| Tsuruoka, Miyao, and Kazama (2011)
| [http://www.logos.t.u-tokyo.ac.jp/~tsuruoka/lapos/ LAPOS]
| No
| 97.22%
| Not available
| MIT
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost COMPOST]
| No
| 97.23%
| Not available
| Non-free ([http://ufal.mff.cuni.cz/compost/register.php academic-only])
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost COMPOST]
| Yes
| 97.44%
| Not available
| Unknown
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| GPL v2+
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
| New BSD
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
| "GNU"(?)
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
| CeCILL-C
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa, Yusuke Miyao, and Jun’ichi Kazama. 2011. "[http://aclweb.org/anthology-new/W/W11/W11-0328.pdf Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models?]" ''Proceedings of the Fifteenth Conference on Computational Natural Language Learning'', pp 238–246, 2011.

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

POS Tagging (State of the art)

2012-12-29T17:10:22Z

Kiwibird: sort

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| CeCILL-C
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| LGPL 2.1
|-
| LAPOS
| Perceptron based training with lookahead
| Tsuruoka, Miyao, and Kazama (2011)
| [http://www.logos.t.u-tokyo.ac.jp/~tsuruoka/lapos/ LAPOS]
| No
| 97.22%
| Not available
| MIT
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost COMPOST]
| No
| 97.23%
| Not available
| Non-free ([http://ufal.mff.cuni.cz/compost/register.php academic-only])
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost COMPOST]
| Yes
| 97.44%
| Not available
| Unknown
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| GPL v2+
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa, Yusuke Miyao, and Jun’ichi Kazama. 2011. "[http://aclweb.org/anthology-new/W/W11/W11-0328.pdf Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models?]" ''Proceedings of the Fifteenth Conference on Computational Natural Language Learning'', pp 238–246, 2011.

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

POS Tagging (State of the art)

2012-12-29T17:08:49Z

Kiwibird: /* WSJ */

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| CeCILL-C
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| LGPL 2.1
|-
| LAPOS
| Perceptron based training with lookahead
| Tsuruoka, Miyao, and Kazama (2011)
| [http://www.logos.t.u-tokyo.ac.jp/~tsuruoka/lapos/ LAPOS]
| No
| 97.22%
| Not available
| MIT
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| No
| 97.23%
| Not available
| Non-free ([http://ufal.mff.cuni.cz/compost/register.php academic-only])
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| GPL v2+
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| Yes
| 97.44%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa, Yusuke Miyao, and Jun’ichi Kazama. 2011. "[http://aclweb.org/anthology-new/W/W11/W11-0328.pdf Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models?]" ''Proceedings of the Fifteenth Conference on Computational Natural Language Learning'', pp 238–246, 2011.

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

POS Tagging (State of the art)

2012-12-29T17:06:16Z

Kiwibird: /* WSJ */

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| Unknown
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| LGPL 2.1
|-
| LAPOS
| Perceptron based training with lookahead
| Tsuruoka, Miyao, and Kazama (2011)
| [http://www.logos.t.u-tokyo.ac.jp/~tsuruoka/lapos/ LAPOS]
| No
| 97.22%
| Not available
| MIT
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| No
| 97.23%
| Not available
| Non-free ([http://ufal.mff.cuni.cz/compost/register.php academic-only])
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| GPL v2+
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| Yes
| 97.44%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa, Yusuke Miyao, and Jun’ichi Kazama. 2011. "[http://aclweb.org/anthology-new/W/W11/W11-0328.pdf Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models?]" ''Proceedings of the Fifteenth Conference on Computational Natural Language Learning'', pp 238–246, 2011.

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

POS Tagging (State of the art)

2012-12-29T17:01:39Z

Kiwibird: The SVMTool library is licensed under LGPL 2.1

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
! License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| Unknown
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| LGPL 2.1
|-
| LAPOS
| Perceptron based training with lookahead
| Tsuruoka, Miyao, and Kazama (2011)
| [http://www.logos.t.u-tokyo.ac.jp/~tsuruoka/lapos/ LAPOS]
| No
| 97.22%
| Not available
| MIT
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| No
| 97.23%
| Not available
| Unknown
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| GPL v2+
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| GPL v2+
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| Yes
| 97.44%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa, Yusuke Miyao, and Jun’ichi Kazama. 2011. "[http://aclweb.org/anthology-new/W/W11/W11-0328.pdf Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models?]" ''Proceedings of the Fifteenth Conference on Computational Natural Language Learning'', pp 238–246, 2011.

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

Language Identification Tools

2012-12-19T14:41:02Z

Kiwibird: wops

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
** http://www.mnogosearch.org/guesser/ – another C reimplementation

* Languid/GuessLanguage, trigram based
** http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski
** http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE
** https://bitbucket.org/spirit/guess_language Python3 version by Phi-Long Do, supports Python2 via lib3to2

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-17T11:08:52Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
** http://www.mnogosearch.org/guesser/ – another C reimplementation

* Languid/GuessLanguage, trigram based
** http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski
** http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE
** https://bitbucket.org/spirit/guess_language Python2 version by Phi-Long Do, supports Python3 via lib3to2

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-17T11:08:43Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
** http://www.mnogosearch.org/guesser/ – another C reimplementation

* Languid/GuessLanguage, trigram based
** http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski
** http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE
** https://bitbucket.org/spirit/guess_language Python2 version by Phi-Long Do, supports Python3 via lib3to2

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification

2012-12-17T11:01:58Z

Kiwibird: Redirected page to Language Detection

#REDIRECT [[Language Detection]]

Resources for Chinese

2012-12-06T16:05:18Z

Kiwibird: /* Unknown license */

==Tools==
===Free software===
* [https://github.com/yzhang/rseg rseg] word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
* [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
* [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
* [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" license?)
** where is the source code?

==Data==
===Free software===
* [http://corpora.heliohost.org/ HC Corpora] 1606811 lines of [http://en.wikipedia.org/wiki/Fair_use Fair Use] excerpts from news, blogs, twitter

===Unknown license===
* [http://www.chinesecomputing.com Chinese Computing]
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
* [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus]
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]

[[Category:Resources by language|Chinese]]

Resources for Chinese

2012-12-06T16:02:55Z

Kiwibird: /* Free software */

==Tools==
===Free software===
* [https://github.com/yzhang/rseg rseg] word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
* [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
* [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
* [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" license?)
** where is the source code?

==Data==
===Unknown license===
* [http://www.chinesecomputing.com Chinese Computing]
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
* [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus]
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]

[[Category:Resources by language|Chinese]]

Resources for Chinese

2012-12-06T16:02:31Z

Kiwibird: /* Free software */

==Tools==
===Free software===
* [https://github.com/yzhang/rseg rseg] word segmentation, in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
* [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
* [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
* [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" license?)
** where is the source code?

==Data==
===Unknown license===
* [http://www.chinesecomputing.com Chinese Computing]
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
* [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus]
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]

[[Category:Resources by language|Chinese]]

Resources for Chinese

2012-12-06T15:56:29Z

Kiwibird:

==Tools==
===Free software===
* [https://github.com/yzhang/rseg rseg] word segmentation, in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
* [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" – where is the source code though?)

==Data==
===Unknown license===
* [http://www.chinesecomputing.com Chinese Computing]
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
* [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus]
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]

[[Category:Resources by language|Chinese]]

Language Identification Tools

2012-12-06T09:25:36Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
** http://www.mnogosearch.org/guesser/ – another C reimplementation

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-06T09:13:44Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
** http://www.mnogosearch.org/guesser/ – another C reimplementation

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-06T09:08:23Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
** Interfaces to the C library libtextcat:
*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java reimplementation

* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-06T09:01:34Z

Kiwibird: /* Free Software */

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.

==Free Software==
* LibTextCat http://software.wise-guys.nl/libtextcat/ (BSD license)
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat
** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
** http://olivo.net/software/lc4j/ – a java implementation
** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
** https://code.google.com/p/language-detection/ source code, data for 53 languages
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

==Proprietary==
* Google Language Identification API
* Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

==See also==
* [[Language Identification (State of the art)]]
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Language Identification Tools

2012-12-06T08:59:19Z

Kiwibird: /* See also */

Language Identification (State of the art)

2012-12-06T08:56:09Z

Kiwibird: /* See also */

== "Standard" measure: ==

== "Standard" datasets: ==

{{StateOfTheArtTable}}

| SystemName || How does it work? || Author and Article [http://www.bla.com] || Software? || 98% according to... || Any extra comments?
|-
| textcat || n-gram matching || Cavnar, W. B. and J. M. Trenkle (1994) "[http://www.nonlineardynamics.com/trenkle/papers/sdr94ps.gz N-Gram-Based Text Categorization]" || Yes || - || -
|-

|}

==See also==
* [[Language Identification Tools]]

[[Category:State of the art]]

Language Identification (State of the art)

2012-12-06T08:55:54Z

Kiwibird:

== "Standard" measure: ==

== "Standard" datasets: ==

{{StateOfTheArtTable}}

| SystemName || How does it work? || Author and Article [http://www.bla.com] || Software? || 98% according to... || Any extra comments?
|-
| textcat || n-gram matching || Cavnar, W. B. and J. M. Trenkle (1994) "[http://www.nonlineardynamics.com/trenkle/papers/sdr94ps.gz N-Gram-Based Text Categorization]" || Yes || - || -
|-

|}

==See also==
[[Language Identification Tools]]

[[Category:State of the art]]

Language Detection

2012-12-06T08:55:30Z

Kiwibird: Redirected page to Language Identification Tools

#REDIRECT [[Language Identification Tools]]

Language Identification Tools

2012-12-06T08:54:28Z

Kiwibird: /* Free Software */

Language Identification Tools

2012-12-06T08:52:18Z

Kiwibird: /* Free Software */

Language Identification Tools

2012-12-06T08:49:32Z

Kiwibird:

Language Identification Tools

2012-12-06T08:48:45Z

Kiwibird: Created page with "A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch)..."

Resources for Polish

2012-08-06T10:41:19Z

Kiwibird: /* Free/Open Source Software */

==Corpora==
* [http://korpus.pl/en/ IPI PAN Corpus] - The IPI PAN Corpus is a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish, developed by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS)
* [http://korpus.pwn.pl/index_en.php PWN Corpus] - PWN has prepared and made available an online version of the Corpus of Polish consisting of 40 million words. The samples were taken from 386 books, 977 editions selected from 185 different press publications, 84 transcribed spoken texts, 207 web sites and several hundred advertising leaflets and other ephemera. The full version of the corpus is available on payment for access, while a demonstration version of over 7.5 million words is available free of charge.

==Taggers, parsers, morphology analysers==

==Free/Open Source Software==
* [http://morfologik.blogspot.com/ Morfologik] -- morphological dictionary by Marcin Miłkowski (of LanguageTool), licensed under CC-SA / GNU LGPL
** [http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted Morfologik converted to the IKIPI tagset] (the tagset of the IPI PAN Corpus)
* [http://nlp.pwr.wroc.pl/en/tools-and-resources/narzedzia-przetwarzania-morfosyntaktycznego Morphosyntactic Toolchain] by WrocUT Language Technology Group G4.19, licensed under GNU LGPL (some optional addons are GNU GPL). Command-line utilities providing tokenisation, morphological analysis, morphosyntactic tagging, shallow parsing (chunking), WCCL feature vectors for machine learning.

==Unknown license==
* [http://nlp.ipipan.waw.pl/~wolinski/morfeusz/ "Morfeusz"] - morphological analyser of Polish (Wolinski, 2005),
** [http://www.springerlink.com/content/l101v8823391j568/ main reference] Morfeusz — a Practical Tool for the Morphological Analysis of Polish
* "AMOR" - morphology analyser of Polish (Joanna Rabiega, 2000),
** [http://members.chello.pl/jrw/doc/jr_ma.pdf/ main reference] Podstawy lingwistyczne automatycznego analizatora morfologicznego AMOR
* [http://duch.mimuw.edu.pl/~kszafran/index.php?option=com_docman&task=cat_view&gid=33&Itemid=43 "SAM"] - morphological analyser of Polish (Krzysztof Szafran, 1994),
* [http://sourceforge.net/project/showfiles.php?group_id=166344 Morfologik] - Polish morphological analyzer based on current ispell dictionaries, and Java libraries interfacing it. First completely open-source and comprehensive morphological tools for Polish. Will be used for grammar correction tools (to be included in the future)
* [http://nlp.ipipan.waw.pl/Spejd/ Spejd - Shallow Parsing and Disambiguation Engine]
* [http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml lemmatizer] - Dawid Weiss

==Lexical resources==

==Bibliography==

==External links==
* [http://bach.ipipan.waw.pl/mailman/listinfo/ling Polish linguistics mailing list] - mainly in Polish

[[Category:Resources by language|Polish]]

Talk:Resources for German

2012-07-15T08:40:06Z

Kiwibird: Created page with "More resources discussed at http://thread.gmane.org/gmane.science.linguistics.corpora/16011/focus=16025"

More resources discussed at http://thread.gmane.org/gmane.science.linguistics.corpora/16011/focus=16025

POS Tagging (State of the art)

2012-06-15T14:03:49Z

Kiwibird: /* WSJ */

==Test collections==
* '''Performance measure:''' per token accuracy. (The convention is for this to be measured on all tokens, including punctuation tokens and other unambiguous tokens.)
* '''English'''
** '''Penn Treebank''' ''Wall Street Journal'' (WSJ) release 3 (LDC99T42). The splits of data for this task were not standardized early on (unlike for parsing) and early work uses various data splits defined by counts of tokens or by sections. Most work from 2002 on adopts the following data splits, introduced by Collins (2002):
*** '''Training data:''' sections 0-18
*** '''Development test data:''' sections 19-21
*** '''Testing data:''' sections 22-24

* '''French'''
** '''French TreeBank''' (FTB, Abeillé et al; 2003) ''Le Monde'', December 2007 version, 28-tag tagset (CC tagset, Crabbé and Candito, 2008). Classical data split (10-10-80):
*** '''Training data:''' sentences 2471 to 12351
*** '''Development test data:''' sentences 1236 to 2470
*** '''Testing data:''' sentences 1 to 1235

== Tables of results ==

===WSJ===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
| License
|-
| TnT*
| Hidden markov model
| Brants (2000)
| [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]
| No
| 96.46%
| 85.86%
| Unknown
|-
| MElt
| MEMM with external lexical information
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 96.96%
| 91.29%
| Unknown
|-
| GENiA Tagger**
| Maximum entropy cyclic dependency network
| Tsuruoka, et al (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ GENiA]
| No
| 97.05%
| Not available
| Gratis for non-commercial usage
|-
| Averaged Perceptron
| Averaged Perception discriminative sequence model
| Collins (2002)
| Not available
| No
| 97.11%
| Not available
| Unknown
|-
| Maxent easiest-first
| Maximum entropy bidirectional easiest-first inference
| Tsuruoka and Tsujii (2005)
| [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/ Easiest-first]
| No
| 97.15%
| Not available
| Unknown
|-
| SVMTool
| SVM-based tagger and tagger generator
| Giménez and Márquez (2004)
| [http://www.lsi.upc.es/~nlp/SVMTool/ SVMTool]
| No
| 97.16%
| 89.01%
| Unknown
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| No
| 97.23%
| Not available
| Unknown
|-
| Stanford Tagger 1.0
| Maximum entropy cyclic dependency network
| Toutanova et al. (2003)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.24%
| 89.04%
| Unknown
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| No
| 97.29%
| 89.70%
| Unknown
|-
| Stanford Tagger 2.0
| Maximum entropy cyclic dependency network
| Manning (2011)
| [http://nlp.stanford.edu/software/tagger.shtml Stanford Tagger]
| Yes
| 97.32%
| 90.79%
| Unknown
|-
| LTAG-spinal
| Bidirectional perceptron learning
| Shen et al. (2007)
| [http://www.cis.upenn.edu/~xtag/spinal/ LTAG-spinal]
| No
| 97.33%
| Not available
| Unknown
|-
| Morče/COMPOST
| Averaged Perceptron
| Spoustová et al. (2009)
| [http://ufal.mff.cuni.cz/compost]
| Yes
| 97.44%
| Not available
| Unknown
|-
| SCCN
| Semi-supervised condensed nearest neighbor
| Søgaard (2011)
| [http://cst.dk/anders/scnn/ SCCN]
| Yes
| 97.50%
| Not available
| Unknown
|}

(*) TnT: Accuracy is as reported by Giménez and Márquez (2004) for the given test collection. Brants (2000) reports 96.7% token accuracy and 85.5% unknown word accuracy on a 10-fold cross-validation of the Penn WSJ corpus.

(**) GENiA: Results are for models trained and tested on the given corpora (to be comparable to other results). The distributed GENiA tagger is trained on a mixed training corpus and gets 96.94% on WSJ, and 98.26% on GENiA biomedical English.

(***) Extra data: Whether system training exploited (usually large amounts of) extra unlabeled text, such as by semi-supervised learning, self-training, or using distributional similarity features, beyond the standard supervised training data.

===FTB===

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! System name
! Short description
! Main publication
! Software
! Extra Data?***
! All tokens
! Unknown words
|-
| Morfette
| Perceptron with external lexical information*
| Chrupała et al. (2008), Seddah et al. (2010)
| [http://sites.google.com/site/morfetteweb/ Morfette]
| No
| 97.68%
| 90.52%
|-
| SEM
| CRF with external lexical information*
| Constant et al. (2011)
| [http://www.univ-orleans.fr/lifo/Members/Isabelle.Tellier/SEM.html SEM]
| No
| 97.7%
| Not available
|-
| MElt
| MEMM with external lexical information*
| Denis and Sagot (2009)
| [https://gforge.inria.fr/projects/lingwb/ Alpage linguistic workbench]
| No
| 97.80%
| 91.77%
|}

(*) External lexical information from the Lefff lexicon (Sagot 2010, [https://gforge.inria.fr/frs/?group_id=482 Alexina project])

== References ==

* Brants, Thorsten. 2000. [http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf TnT -- A Statistical Part-of-Speech Tagger]. "6th Applied Natural Language Processing Conference".

* Chrupała, Grzegorz, Dinu, Georgiana and van Genabith, Josef. 2008. [http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf Learning Morphology with Morfette]. "LREC 2008".

* Collins, Michael. 2002. [http://people.csail.mit.edu/mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms]. ''EMNLP 2002''.

* Constant, Matthieu, Tellier, Isabelle, Duchier, Denys, Dupont, Yoann, Sigogne, Anthony, and Billot, Sylvie. [http://www.lirmm.fr/~lopez/TALN2011/Longs-TALN+RECITAL/Tellier_taln11_submission_54.pdf Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français]. "TALN'11"

* Denis, Pascal and Sagot, Benoît. 2009. [http://alpage.inria.fr/~sagot/pub/paclic09tagging.pdf Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort]. "PACLIC 2009"

* Giménez, J., and Márquez, L. 2004. [http://www.lsi.upc.es/~nlp/SVMTool/lrec2004-gm.pdf SVMTool: A general POS tagger generator based on Support Vector Machines]. ''Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)''. Lisbon, Portugal.

* Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171--189. Springer.

* Seddah, Djamé, Chrupała, Grzegorz, Çetinoglu, Özlem and Candito, Marie. 2010. [http://aclweb.org/anthology-new/W/W10/W10-1410.pdf Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French] "SPMRL 2010 (NAACL 2010 workshop)"

* Shen, L., Satta, G., and Joshi, A. 2007. [http://acl.ldc.upenn.edu/P/P07/P07-1096.pdf Guided learning for bidirectional sequence classification]. ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)'', pages 760-767.

* Søgaard, Anders. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon.

* Spoustová, Drahomíra "Johanka", Jan Hajič, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

* Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. [http://nlp.stanford.edu/kristina/papers/tagging.pdf Feature-rich part-of-speech tagging with a cyclic dependency network]. ''Proceedings of HLT-NAACL 2003'', pages 252-259.

* Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/pci05.pdf Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics]" - ''10th Panhellenic Conference on Informatics'', '''LNCS 3746''', pp. 382-392, 2005

* Tsuruoka, Yoshimasa and Jun'ichi Tsujii. 2005. "[http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data]", ''Proceedings of HLT/EMNLP 2005'', pp. 467-474.

== See also ==
* [[POS Induction (State of the art)]]
* [[Part-of-speech tagging]]
* [[State of the art]]

[[Category:State of the art]]

Resources for Slovenian

2012-05-30T12:36:47Z

Kiwibird: /* Corpora */

==Corpora==
* [http://nl.ijs.si/elan/ IJS - ELAN] Slovene-English Parallel Corpus
** License: "freely available for downloading, but please acknowledge in any publications"

* [http://nl.ijs.si/ME/ Multext EAST] lexica, annotated "1984" corpus, parallel and comparable text and speech corpora.
** License: "research use only"
** Languages involved: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian

* [http://langtech.jrc.it/JRC-Acquis.html JRC Acquis] parallel texts.
** License: Public domain.
** Languages involved: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.

[[Category:Resources by language|Solvenian]]

Resources for Slovenian

2012-05-30T12:34:42Z

Kiwibird: /* Corpora */

==Corpora==
* [http://nl.ijs.si/elan/ Slovene-English IJS - ELAN Parallel Corpus] License: "freely available for downloading, but please acknowledge in any publications"

* [http://nl.ijs.si/ME/ Multext EAST] License: "research use only"

* [http://langtech.jrc.it/JRC-Acquis.html JRC Acquis] parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish. License: Public domain.

[[Category:Resources by language|Solvenian]]

Resources for Slovenian

2012-05-30T12:31:55Z

Kiwibird: /* Corpora */

==Corpora==
* [http://nl.ijs.si/elan/ Slovene-English IJS - ELAN Parallel Corpus] License: "freely available for downloading, but please acknowledge in any publications"
* [http://langtech.jrc.it/JRC-Acquis.html JRC Acquis] parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish. License: Public domain.

[[Category:Resources by language|Solvenian]]

Resources for Slovenian

2012-05-30T12:30:40Z

Kiwibird: /* Corpora */

==Corpora==
* [http://nl.ijs.si/elan/ Slovene-English Parallel Corpus] License: "freely available for downloading, but please acknowledge in any publications"
* [http://langtech.jrc.it/JRC-Acquis.html JRC Acquis] parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish. License: Public domain.

[[Category:Resources by language|Solvenian]]

Resources for Norwegian

2012-02-23T14:10:16Z

Kiwibird: /* Free software */

==Corpora==
===Free software===

===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Norwegian plain text and Co-occurrences at LCC] ("the corpora may be used for scientific purposes only and not passed on to third parties")

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wdtno/ Ord I Dag]

==Machine translation systems==

===Free software===

* [http://www.apertium.org Apertium] Norwegian Nynorsk<->Norwegian Bokmål, GPL v2
** [http://wiki.apertium.org/wiki/Apertium-nn-nb wiki] with installation information etc.

===Proprietary===

==Lexical resources==
===Free software===
* [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg Bokmål LFG lexicon] with POS and count/mass, GPL
* [http://www.edd.uio.no/prosjekt/ordbanken/ Norsk ordbank], full form dictionaries for Nynorsk (106,789 lemmata) and Bokmål (142,899 lemmata), GPL
** [http://savannah.nongnu.org/projects/ordbanken/ alternative download with cli lookup interface]
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar SCARRIE, Bokmål full form dictionary], XML, about 75,000 lemmata, CC-BY unported

===Unknown license===
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar "Leksikalsk database for norsk, opphavleg produsert av NST"], lexical database with SAMPA transcriptions, meant for speech technology

==Parsing/disambiguation==
===Free software===
* [http://www.hf.uio.no/tekstlab/tagger.html Oslo-Bergen-taggeren], [[Constraint Grammar]] disambiguator, GPL
** [https://github.com/noklesta/The-Oslo-Bergen-Tagger source and packages on github]
** [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger older alternative download site]
** [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/ the version used in Apertium]
** [https://github.com/ogrim/clj-obt Clojure bindings]

* [http://www.hf.ntnu.no/hf/isk/Ansatte/petter.haugereid/norsyg.html Norsyg], [[HPSG]] grammar for Norwegian bokmål, LGPL. Implemented in [[LKB]], works with the full ''Norsk ordbank'' lexicon.

[[Category:Resources by language|Norwegian]]

Resources for Norwegian

2012-02-23T14:08:58Z

Kiwibird: /* Free software */

==Corpora==
===Free software===

===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Norwegian plain text and Co-occurrences at LCC] ("the corpora may be used for scientific purposes only and not passed on to third parties")

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wdtno/ Ord I Dag]

==Machine translation systems==

===Free software===

* [http://www.apertium.org Apertium] Norwegian Nynorsk<->Norwegian Bokmål, GPL v2
** [http://wiki.apertium.org/wiki/Apertium-nn-nb wiki] with installation information etc.

===Proprietary===

==Lexical resources==
===Free software===
* [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg Bokmål LFG lexicon] with POS and count/mass, GPL
* [http://www.edd.uio.no/prosjekt/ordbanken/ Norsk ordbank], full form dictionaries for Nynorsk (106,789 lemmata) and Bokmål (142,899 lemmata), GPL
** [http://savannah.nongnu.org/projects/ordbanken/ alternative download with cli lookup interface]
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar SCARRIE, Bokmål full form dictionary], XML, about 75,000 lemmata, CC-BY unported

===Unknown license===
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar "Leksikalsk database for norsk, opphavleg produsert av NST"], lexical database with SAMPA transcriptions, meant for speech technology

==Parsing/disambiguation==
===Free software===
* [http://www.hf.uio.no/tekstlab/tagger.html Oslo-Bergen-taggeren], [[Constraint Grammar]] disambiguator, GPL
** [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger download], [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/ the version used in Apertium]

* [http://www.hf.ntnu.no/hf/isk/Ansatte/petter.haugereid/norsyg.html Norsyg], [[HPSG]] grammar for Norwegian bokmål, LGPL. Implemented in [[LKB]], works with the full ''Norsk ordbank'' lexicon.

[[Category:Resources by language|Norwegian]]

Resources for Norwegian

2012-02-23T14:06:58Z

Kiwibird: /* Free software */

==Corpora==
===Free software===

===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Norwegian plain text and Co-occurrences at LCC] ("the corpora may be used for scientific purposes only and not passed on to third parties")

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wdtno/ Ord I Dag]

==Machine translation systems==

===Free software===

* [http://www.apertium.org Apertium] Norwegian Nynorsk<->Norwegian Bokmål, GPL v2
** [http://wiki.apertium.org/wiki/Apertium-nn-nb wiki] with installation information etc.

===Proprietary===

==Lexical resources==
===Free software===
* [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg Bokmål LFG lexicon] with POS and count/mass, GPL
* [http://www.edd.uio.no/prosjekt/ordbanken/ Norsk ordbank], full form dictionaries for Nynorsk (106,789 lemmata) and Bokmål (142,899 lemmata), GPL
** [http://savannah.nongnu.org/projects/ordbanken/ alternative download with cli lookup interface]
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar SCARRIE, Bokmål full form dictionary], XML, about 75,000 lemmata, CC-BY unported

==Parsing/disambiguation==
===Free software===
* [http://www.hf.uio.no/tekstlab/tagger.html Oslo-Bergen-taggeren], [[Constraint Grammar]] disambiguator, GPL
** [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger download], [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/ the version used in Apertium]

* [http://www.hf.ntnu.no/hf/isk/Ansatte/petter.haugereid/norsyg.html Norsyg], [[HPSG]] grammar for Norwegian bokmål, LGPL. Implemented in [[LKB]], works with the full ''Norsk ordbank'' lexicon.

[[Category:Resources by language|Norwegian]]

Resources for Norwegian

2012-02-23T14:06:47Z

Kiwibird: /* Free software */

==Corpora==
===Free software===

===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Norwegian plain text and Co-occurrences at LCC] ("the corpora may be used for scientific purposes only and not passed on to third parties")

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wdtno/ Ord I Dag]

==Machine translation systems==

===Free software===

* [http://www.apertium.org Apertium] Norwegian Nynorsk<->Norwegian Bokmål, GPL v2
** [http://wiki.apertium.org/wiki/Apertium-nn-nb wiki] with installation information etc.

===Proprietary===

==Lexical resources==
===Free software===
* [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg Bokmål LFG lexicon] with POS and count/mass, GPL
* [http://www.edd.uio.no/prosjekt/ordbanken/ Norsk ordbank], full form dictionaries for Nynorsk (106,789 lemmata) and Bokmål (142,899 lemmata), GPL
** [http://savannah.nongnu.org/projects/ordbanken/ alternative download with cli lookup interface]
* [http://www.nb.no/spraakbanken/tilgjengelege-ressursar/leksikalske-databasar SCARRIE, Bokmål full form dictionary), XML, about 75,000 lemmata, CC-BY unported

==Parsing/disambiguation==
===Free software===
* [http://www.hf.uio.no/tekstlab/tagger.html Oslo-Bergen-taggeren], [[Constraint Grammar]] disambiguator, GPL
** [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger download], [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/ the version used in Apertium]

* [http://www.hf.ntnu.no/hf/isk/Ansatte/petter.haugereid/norsyg.html Norsyg], [[HPSG]] grammar for Norwegian bokmål, LGPL. Implemented in [[LKB]], works with the full ''Norsk ordbank'' lexicon.

[[Category:Resources by language|Norwegian]]

Resources for Japanese

2011-12-02T07:56:07Z

Kiwibird: /* Multilingual */

There is a very good list at Kyoto University: [http://www-lab25.kuee.kyoto-u.ac.jp/NLP_Portal/lr-cat-e.html Catalogue of Language Resources and Tools in Japan]

==Corpora==
===Proprietary===
* [http://corpora.informatik.uni-leipzig.de/ Japanese plain text and Co-occurrences at LCC] (downloadable and web-searchable, but only for non-commercial use)
* [http://www.ninjal.ac.jp/english/products/bccwj/ Balanced Corpus of Contemporary Written Japanese (BCCWJ)] (subset is web searchable at Kotonoha)

===Free/Open Licence===
====Multilingual====
* [http://www.edrdg.org/projects/tanaka/tanakacorpus.html Tanaka Corpus] by Jim Breen, under a CC-BY-SA 3.0 licence
** [http://tatoeba.org/eng/home Tatoeba] Updated version of the Tanaka Corpus; ≈150,000 sentence pairs (CC-BY)
* [http://alaginrc.nict.go.jp/WikiCorpus/index_E.html Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles] ≈500,000 pairs of manually-translated sentences (CC-BY 3.0)
* [http://id.ndl.go.jp/auth/ndlsh National Diet Library Subject Headers] Japanese Subject Headers, with paraphrases including English Translations([http://id.ndl.go.jp/auth/docs/about-ndlsh#03 non-commercial attribution])
* [http://mastarpj.nict.go.jp/~mutiyama/align/index.html English-Japanese Translation Alignment Data] aligned by [http://mastarpj.nict.go.jp/~mutiyama/ Masao Utiyama] (GFDL, CC-by-nc 1.0)
* [http://nlpwww.nict.go.jp/wn-ja/index.en.html WordNet Definitions and Glosses] ≈180,000 sentence/phrase pairs (WordNet license, similar to BSD)
* [http://www.phontron.com/kftt/#alignments The Kyoto Free Translation Task (KFTT)] by Graham Neubig, 1235 sentences of Japanese-English manually word-aligned

====Monolingual====
* [http://www-lab25.kuee.kyoto-u.ac.jp/NLP_Portal/lr-cat-e.html#jp:knb_corpus Kyoto University and NTT Blog Corpus]

== Grammars ==
===Free/Open Licence===
* [http://wiki.delph-in.net/moin/JacyTop Jacy HPSG grammar] MIT Licence
===Unknown licence===
* [[Generation grammars|KPML generation grammar]] (downloadable)

==Dictionaries==
===Free/Open Licence===
* [http://www.csse.monash.edu.au/~jwb/edict.html EDICT] Japanese-English dictionary, by Jim Breen, (CC-BY-SA 3.0 licence)
* [http://www.csse.monash.edu.au/~jwb/enamdict_doc.html ENAMDICT/JMnedict] proper name dictionary, by Jim Breen, (CC-BY-SA 3.0 licence)
* [http://nlpwww.nict.go.jp/wn-ja/index.en.html Japanese version of WordNet] by NICT, (WordNet license, like BSD)

===Unknown licence===
* [http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html List of Japanese transitive/intransitive verb pairs] (dead link?)

[[Category:Resources by language|Japanese]]