Language Identification Tools

From ACLWiki
(Difference between revisions)
Jump to: navigation, search
(Free Software)
Line 4: Line 4:
  
 
==Free Software==
 
==Free Software==
* TextCat
+
* LibTextCat http://software.wise-guys.nl/libtextcat/ (BSD license)
** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat/LM - language models for the perl version
+
** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat
** http://olivo.net/software/lc4j/ - a java implementation
+
** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
 +
** http://olivo.net/software/lc4j/ a java implementation
 +
** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
 
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license)
 
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license)
 
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
 
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)

Revision as of 04:52, 6 December 2012

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.

Free Software

Proprietary

See also

Personal tools