Resources for Arabic
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Morphology
Free software
- AraMorph - Perl - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter)
- AraMorph - Java - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for Lucene
- AraComLex - An open source finite state morphology for Modern Standard Arabic. The source files can be compiled by the open source compiler, foma, or Xerox xfst.
- UralicNLP is a Python library that provides morphological tagging, generation, lemmatization and disambiguation in many languages including Arabic
Proprietary
WordNets
Free software
- http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets
Proprietary
- http://babelnet.org/ (available for download for "Non-Commercial" use)
Parsers
Free software
- Bikel's implementation of Collins Parser by Dan Bikel.
- Arabic dictionaries, by Jon Dehdari, for the Link-Grammar parser. These require the Aramorph stemming package, above.
- ElixirFM (online interface here) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser.
- Sarf - Arabic Morphology System (all in Java)
Corpora
Proprietary
- Arabic Newswire Part 1, 76 million tokens, annotation: paragraphs
Diacritization
Free software
- hAraCat a free tool for predicting vowels and other diacritics.
Free/open licence
- Meedan-Memory, Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, Open Database Licence
- Quranic Arabic Corpus, 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; dependency grammar annotation is planned.
- Arabic NER corpora by Yassine Benajiba, 150,000+ words.
- UN parallel corpora
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.