Difference between revisions of "Language Identification Tools"
Jump to navigation
Jump to search
(wops) |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
==Free Software== | ==Free Software== | ||
− | * LibTextCat http://software.wise-guys.nl/libtextcat/ (BSD license) | + | * LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license) |
− | ** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat | + | ** Interfaces to the C library libtextcat: |
− | ** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes | + | *** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat |
− | ** http://olivo.net/software/lc4j/ – a java | + | *** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat |
− | ** http:// | + | *** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat |
− | ** | + | ** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation |
− | ** | + | *** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes |
− | * Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license) | + | ** http://olivo.net/software/lc4j/ – a java reimplementation |
+ | ** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin | ||
+ | ** http://www.mnogosearch.org/guesser/ – another C reimplementation | ||
+ | |||
+ | |||
+ | * Languid/GuessLanguage, trigram based | ||
+ | ** http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski | ||
+ | ** http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE | ||
+ | ** https://bitbucket.org/spirit/guess_language Python3 version by Phi-Long Do, supports Python2 via lib3to2 | ||
+ | |||
+ | |||
+ | * Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license) | ||
+ | ** https://code.google.com/p/language-detection/ source code, data for 53 languages | ||
+ | ** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection | ||
+ | |||
+ | |||
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license) | * Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license) | ||
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google | ** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google | ||
+ | |||
+ | |||
+ | * LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3) | ||
==Proprietary== | ==Proprietary== | ||
Line 22: | Line 40: | ||
* [[Language Identification (State of the art)]] | * [[Language Identification (State of the art)]] | ||
* [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection] | * [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection] | ||
+ | * [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord |
Latest revision as of 07:41, 19 December 2012
A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).
Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.
Free Software
- LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
- Interfaces to the C library libtextcat:
- http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
- https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
- https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
- http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
- http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
- http://olivo.net/software/lc4j/ – a java reimplementation
- http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
- http://www.mnogosearch.org/guesser/ – another C reimplementation
- Interfaces to the C library libtextcat:
- Languid/GuessLanguage, trigram based
- http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski
- http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE
- https://bitbucket.org/spirit/guess_language Python3 version by Phi-Long Do, supports Python2 via lib3to2
- Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
- https://code.google.com/p/language-detection/ source code, data for 53 languages
- https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection
- Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
- doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
- LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)
Proprietary
- Google Language Identification API
- Lingua-Systems lid http://www.lingua-systems.com/language-identifier/
See also
- Language Identification (State of the art)
- English Wikipedia on Language detection
- TextCat competitors – list compiled by Gertjan van Noord