Difference between revisions of "Resources for Arabic"

Latest revision as of 04:36, 29 June 2020

AraMorph - Perl - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter)
AraMorph - Java - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for Lucene
AraComLex - An open source finite state morphology for Modern Standard Arabic. The source files can be compiled by the open source compiler, foma, or Xerox xfst.
UralicNLP is a Python library that provides morphological tagging, generation, lemmatization and disambiguation in many languages including Arabic

http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets

Bikel's implementation of Collins Parser by Dan Bikel.
Arabic dictionaries, by Jon Dehdari, for the Link-Grammar parser. These require the Aramorph stemming package, above.
ElixirFM (online interface here) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser.
Sarf - Arabic Morphology System (all in Java)

Meedan-Memory, Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, Open Database Licence
Quranic Arabic Corpus, 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; dependency grammar annotation is planned.
Arabic NER corpora by Yassine Benajiba, 150,000+ words.
UN parallel corpora
HamleDT, harmonized dependency treebanks of many languages, common annotation style.

@@ Line 28: / Line 28: @@
 ===Proprietary===
 *[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1], 76 million tokens, annotation: paragraphs
+==Diacritization==
+===Free software===
+*[https://github.com/mikahama/haracat hAraCat] a free tool for predicting vowels and other diacritics.
 ===Free/open licence===