Difference between revisions of "Resources for Arabic"
Jump to navigation
Jump to search
m (→Free/open licence: quranic arabic corpus) |
|||
(12 intermediate revisions by 9 users not shown) | |||
Line 4: | Line 4: | ||
*[https://sourceforge.net/projects/aramorph/ AraMorph - Perl] - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter) | *[https://sourceforge.net/projects/aramorph/ AraMorph - Perl] - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter) | ||
*[http://www.nongnu.org/aramorph/ AraMorph - Java] - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for [http://lucene.apache.org/ Lucene] | *[http://www.nongnu.org/aramorph/ AraMorph - Java] - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for [http://lucene.apache.org/ Lucene] | ||
+ | *[http://sourceforge.net/projects/aracomlex/ AraComLex] - An open source finite state morphology for Modern Standard Arabic. The source files can be compiled by the open source compiler, foma, or Xerox xfst. | ||
+ | * [https://github.com/mikahama/uralicNLP UralicNLP] is a Python library that provides morphological tagging, generation, lemmatization and disambiguation in many languages including Arabic | ||
===Proprietary=== | ===Proprietary=== | ||
*[http://www.arabic-morphology.com Xerox Arabic Morphological Analyzer and Generator] | *[http://www.arabic-morphology.com Xerox Arabic Morphological Analyzer and Generator] | ||
+ | |||
+ | ==WordNets== | ||
+ | |||
+ | ===Free software=== | ||
+ | * http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets | ||
+ | |||
+ | ===Proprietary=== | ||
+ | * http://babelnet.org/ (available for download for "Non-Commercial" use) | ||
+ | |||
+ | ==Parsers== | ||
+ | ===Free software=== | ||
+ | * [http://www.cis.upenn.edu/~dbikel/software.html#stat-parser Bikel's implementation of Collins Parser] by [http://www.cis.upenn.edu/~dbikel/ Dan Bikel]. | ||
+ | * [http://www.ling.ohio-state.edu/~jonsafari/arabiclg/arabiclg.20060829.tar.bz2 Arabic dictionaries], by [http://www.ling.ohio-state.edu/~jonsafari/ Jon Dehdari], for the [http://www.abisource.com/projects/link-grammar/ Link-Grammar parser]. These require the Aramorph stemming package, above. | ||
+ | * [https://sourceforge.net/apps/trac/elixir-fm/wiki ElixirFM] ([http://quest.ms.mff.cuni.cz/cgi-bin/elixir/index.fcgi online interface here]) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser. | ||
+ | * [http://sourceforge.net/projects/sarf Sarf] - Arabic Morphology System (all in Java) | ||
==Corpora== | ==Corpora== | ||
===Proprietary=== | ===Proprietary=== | ||
− | *[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1] | + | *[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1], 76 million tokens, annotation: paragraphs |
+ | |||
+ | ==Diacritization== | ||
+ | ===Free software=== | ||
+ | *[https://github.com/mikahama/haracat hAraCat] a free tool for predicting vowels and other diacritics. | ||
===Free/open licence=== | ===Free/open licence=== | ||
* [http://github.com/anastaw/Meedan-Memory Meedan-Memory], Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, [http://www.opendatacommons.org/licenses/odbl/ Open Database Licence] | * [http://github.com/anastaw/Meedan-Memory Meedan-Memory], Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, [http://www.opendatacommons.org/licenses/odbl/ Open Database Licence] | ||
* [http://quran.uk.net/ Quranic Arabic Corpus], 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; [[dependency grammar]] annotation is planned. | * [http://quran.uk.net/ Quranic Arabic Corpus], 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; [[dependency grammar]] annotation is planned. | ||
− | + | * [http://www1.ccls.columbia.edu/~ybenajiba/downloads.html Arabic NER corpora] by [http://www1.ccls.columbia.edu/~ybenajiba/ Yassine Benajiba], 150,000+ words. | |
− | + | * [http://www.euromatrixplus.net/multi-un/ UN parallel corpora] | |
− | + | * [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | |
− | * [http:// | ||
==Bibliography== | ==Bibliography== |
Latest revision as of 04:36, 29 June 2020
Morphology
Free software
- AraMorph - Perl - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter)
- AraMorph - Java - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for Lucene
- AraComLex - An open source finite state morphology for Modern Standard Arabic. The source files can be compiled by the open source compiler, foma, or Xerox xfst.
- UralicNLP is a Python library that provides morphological tagging, generation, lemmatization and disambiguation in many languages including Arabic
Proprietary
WordNets
Free software
- http://compling.hss.ntu.edu.sg/omw/ Hebrew Wordnet with links to all the other Open Multilingual Wordnets
Proprietary
- http://babelnet.org/ (available for download for "Non-Commercial" use)
Parsers
Free software
- Bikel's implementation of Collins Parser by Dan Bikel.
- Arabic dictionaries, by Jon Dehdari, for the Link-Grammar parser. These require the Aramorph stemming package, above.
- ElixirFM (online interface here) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser.
- Sarf - Arabic Morphology System (all in Java)
Corpora
Proprietary
- Arabic Newswire Part 1, 76 million tokens, annotation: paragraphs
Diacritization
Free software
- hAraCat a free tool for predicting vowels and other diacritics.
Free/open licence
- Meedan-Memory, Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, Open Database Licence
- Quranic Arabic Corpus, 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; dependency grammar annotation is planned.
- Arabic NER corpora by Yassine Benajiba, 150,000+ words.
- UN parallel corpora
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.