Difference between revisions of "Resources for Chinese"

Latest revision as of 17:42, 2 September 2019

Tools

Free software

rseg word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
ctbparser word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
ZPar word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
DuDuPlus: a graph-based dependency parser for English and Chinese ("Other Open Source" license?)
- where is the source code?

Corpora

Free license

HC Corpora 1606811 lines of Fair Use excerpts from news, blogs, twitter
UN parallel corpora

Nonfree or Unknown license

Araneum Sinicum, Gigaword Chinese web corpus
Chinese Computing
Word Segmented and POS tagged People Daily Corpus at ICL of Peking University
Frequency list of characters in the Internet corpus
Frequency list of lexical items in the Internet corpus
Lancaster Corpus of Mandarin Chinese
A collection of Chinese corpora and frequency lists Online query with three corpora
Chinese Linguistics

@@ Line 1: / Line 1: @@
 ==Tools==
 ===Free software===
-* [https://github.com/yzhang/rseg rseg] word segmentation, in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
+* [https://github.com/yzhang/rseg rseg] word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
 * [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
 * [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
@@ Line 7: / Line 7: @@
 ** where is the source code?
-==Data==
+==Corpora==
-===Unknown license===
+===Free license===
+* [http://corpora.heliohost.org/ HC Corpora] 1606811 lines of [http://en.wikipedia.org/wiki/Fair_use Fair Use] excerpts from news, blogs, twitter
+* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
+===Nonfree or Unknown license===
+* [http://ucts.uniba.sk/aranea_about/ Araneum Sinicum], Gigaword Chinese web corpus
 * [http://www.chinesecomputing.com Chinese Computing]
 * [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
@@ Line 14: / Line 19: @@
 * [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
 * [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]
+* [http://corpus.leeds.ac.uk/query-zh.html A collection of Chinese corpora and frequency lists]  Online query with three corpora
+* [http://pears.lib.ohio-state.edu/China/linguist.html Chinese Linguistics]
 [[Category:Resources by language|Chinese]]

Difference between revisions of "Resources for Chinese"

Latest revision as of 17:42, 2 September 2019

Contents

Tools

Free software

Corpora

Free license

Nonfree or Unknown license

Navigation menu

Search