Difference between revisions of "Language Identification Tools"
Jump to navigation
Jump to search
(Created page with "A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch)...") |
|||
Line 1: | Line 1: | ||
A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch). | A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch). | ||
− | Most of these tools require training on a big corpus (see [[ | + | Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models. |
− | |||
==Free Software== | ==Free Software== |
Revision as of 01:49, 6 December 2012
A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).
Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.
Free Software
- TextCat
- http://opus.lingfil.uu.se/tools/public/language_guesser/textcat/LM - language models for the perl version
- http://olivo.net/software/lc4j/ - a java implementation
- Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license)
- Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
- doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
Proprietary
- Google Language Identification API
- Lingua-Systems lid http://www.lingua-systems.com/language-identifier/