Difference between revisions of "Resources for Japanese"

From ACL Wiki
Jump to navigation Jump to search
 
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 +
There is a very good list at JAIST: [http://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-e.html Catalogue of Language Resources and Tools in Japan]
 +
 
==Corpora==
 
==Corpora==
 
===Proprietary===
 
===Proprietary===
 
* [http://corpora.informatik.uni-leipzig.de/ Japanese plain text and Co-occurrences at LCC] (downloadable and web-searchable, but only for non-commercial use)
 
* [http://corpora.informatik.uni-leipzig.de/ Japanese plain text and Co-occurrences at LCC] (downloadable and web-searchable, but only for non-commercial use)
 +
* [http://www.ninjal.ac.jp/english/products/bccwj/ Balanced Corpus of Contemporary Written Japanese (BCCWJ)] (subset is web searchable at Kotonoha)
 +
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
  
 
===Free/Open Licence===
 
===Free/Open Licence===
* [http://www.edrdg.org/projects/tanaka/tanakacorpus.html Tanaka Corpus] by Jim Breen, under a CC-BY-SA 3.0 licence
+
====Multilingual====
 +
* [http://www.edrdg.org/projects/tanaka/tanakacorpus.html Tanaka Corpus] by Tanaka Yasuhito, edited by Jim Breen, under a CC-BY-SA 3.0 licence
 +
** [http://tatoeba.org/eng/home Tatoeba] Updated version of the Tanaka Corpus;  ≈150,000 sentence pairs  (CC-BY)
 +
* [http://alaginrc.nict.go.jp/WikiCorpus/index_E.html Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles]  ≈500,000 pairs of manually-translated sentences (CC-BY 3.0)
 +
* [http://id.ndl.go.jp/auth/ndlsh National Diet Library Subject Headers]  Japanese Subject Headers, with paraphrases including English Translations ([http://id.ndl.go.jp/auth/docs/about-ndlsh#03 non-commercial attribution])
 +
* [http://www2.nict.go.jp/univ-com/multi_trans/member/mutiyama/ English-Japanese Translation Alignment Data]  aligned by [http://mastarpj.nict.go.jp/~mutiyama/ Masao Utiyama] (GFDL, CC-by-nc 1.0)
 +
* [http://nlpwww.nict.go.jp/wn-ja/index.en.html WordNet Definitions and Glosses]  ≈180,000 sentence/phrase pairs from the [http://nlpwww.nict.go.jp/wn-ja/index.en.html Japanese Wordnet] (WordNet license, similar to BSD)
 +
* [http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html#jsemcor Japanese Translation of SemCor] ≈14,000 sentences from the [http://nlpwww.nict.go.jp/wn-ja/index.en.html Japanese Wordnet], easily aligned to the [http://www.cse.unt.edu/~rada/downloads.html#semcor English source]  (WordNet license, similar to BSD)
 +
* [http://www.phontron.com/kftt/#alignments The Kyoto Free Translation Task (KFTT)] by Graham Neubig, 1,235 sentences of Japanese-English manually word-aligned
 +
* [http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JEC%20Basic%20Sentence%20Data JEC Basic Sentence Data] by Kyoto University: 5,304 basic Japanese sentences based on Kyoto University Case Frame Data, translated in Chinese and English
 +
 
 +
====Monolingual====
 +
* [http://www-lab25.kuee.kyoto-u.ac.jp/NLP_Portal/lr-cat-e.html#jp:knb_corpus Kyoto University and NTT Blog Corpus]
 +
* [http://www.edrdg.org/~jwb/compv/ Compilation of potential Japanese compound verbs] by Jim Breen. 64,776 verb collection with n-gram counts and dictionary references (CC-SA licence)
  
 
== Grammars ==
 
== Grammars ==
===Proprietary===
+
===Free/Open Licence===
* [http://wiki.delph-in.net/moin/JacyTop Jacy HPSG grammar]
+
* [http://wiki.delph-in.net/moin/JacyTop Jacy HPSG grammar] MIT Licence
* [[Generation grammars|KPML generation grammar]]
+
===Unknown licence===
 
+
* [[Generation grammars|KPML generation grammar]] (downloadable)
  
 
==Dictionaries==
 
==Dictionaries==
 
===Free/Open Licence===
 
===Free/Open Licence===
* [http://www.csse.monash.edu.au/~jwb/edict.html EDICT] Japanese-English dictionary, by Jim Breen, under a CC-BY-SA 3.0 licence
+
* [http://www.edrdg.org/jmdict/edict_doc.html JMdict/EDICT] Japanese-English and Japanese-Multilanguage dictionary in text and XML formats, by EDRDG (Electronic Dictionary R&D Group) - 170,000 entries, (CC-BY-SA 3.0 licence)
* [http://www.csse.monash.edu.au/~jwb/enamdict_doc.html ENAMDICT/JMnedict] proper name dictionary, by Jim Breen, under a CC-BY-SA 3.0 licence
+
* [http://www.edrdg.org/enamdict/enamdict_doc.html ENAMDICT/JMnedict] proper name dictionary in text and XML formats - 740,000 entries, by EDRDG (Electronic Dictionary R&D Group), (CC-BY-SA 3.0 licence)
 +
* [http://nlpwww.nict.go.jp/wn-ja/index.en.html Japanese version of WordNet] by NICT, (WordNet license, like BSD)
 +
* [http://www.edrdg.org/kanjidic/kanjidic.html Kanjidic]/[http://www.edrdg.org/kanjidic/kanjd2index.html Kanjidic2] Kanji dictionaries in text and XML formats covering about 13,000 characters, by EDRDG (Electronic Dictionary R&D Group), (CC-BY-SA 3.0 licence)
  
 
===Unknown licence===
 
===Unknown licence===
* [http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html List of Japanese transitive/intransitive verb pairs] (dead link?)
+
* [http://www.sljfaq.org/afaq/jitadoushi.html List of Japanese transitive/intransitive verb pairs] [http://nihongo.monash.edu/ti_list.html earlier version]
  
 
[[Category:Resources by language|Japanese]]
 
[[Category:Resources by language|Japanese]]

Latest revision as of 20:40, 11 October 2017

There is a very good list at JAIST: Catalogue of Language Resources and Tools in Japan

Corpora

Proprietary

Free/Open Licence

Multilingual

Monolingual

Grammars

Free/Open Licence

Unknown licence

Dictionaries

Free/Open Licence

  • JMdict/EDICT Japanese-English and Japanese-Multilanguage dictionary in text and XML formats, by EDRDG (Electronic Dictionary R&D Group) - 170,000 entries, (CC-BY-SA 3.0 licence)
  • ENAMDICT/JMnedict proper name dictionary in text and XML formats - 740,000 entries, by EDRDG (Electronic Dictionary R&D Group), (CC-BY-SA 3.0 licence)
  • Japanese version of WordNet by NICT, (WordNet license, like BSD)
  • Kanjidic/Kanjidic2 Kanji dictionaries in text and XML formats covering about 13,000 characters, by EDRDG (Electronic Dictionary R&D Group), (CC-BY-SA 3.0 licence)

Unknown licence