Difference between revisions of "Resources for Chinese"
Jump to navigation
Jump to search
Sean Bethard (talk | contribs) m (Move * [http://pears.lib.ohio-state.edu/China/linguist.html Chinese Linguistics] (broken link) from Uncategorized resource to Resources for Chinese) |
|||
(12 intermediate revisions by 8 users not shown) | |||
Line 1: | Line 1: | ||
+ | ==Tools== | ||
+ | ===Free software=== | ||
+ | * [https://github.com/yzhang/rseg rseg] word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license) | ||
+ | * [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license) | ||
+ | * [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license) | ||
+ | * [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" license?) | ||
+ | ** where is the source code? | ||
+ | |||
+ | ==Corpora== | ||
+ | ===Free license=== | ||
+ | * [http://corpora.heliohost.org/ HC Corpora] 1606811 lines of [http://en.wikipedia.org/wiki/Fair_use Fair Use] excerpts from news, blogs, twitter | ||
+ | * [http://www.euromatrixplus.net/multi-un/ UN parallel corpora] | ||
+ | |||
+ | ===Nonfree or Unknown license=== | ||
+ | * [http://ucts.uniba.sk/aranea_about/ Araneum Sinicum], Gigaword Chinese web corpus | ||
+ | * [http://www.chinesecomputing.com Chinese Computing] | ||
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University] | * [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University] | ||
− | * [http://corpus.leeds.ac.uk/frqc/i-zh-char.num Frequency list of characters in the Internet corpus] | + | * [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus] |
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus] | * [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus] | ||
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese] | * [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese] | ||
+ | * [http://corpus.leeds.ac.uk/query-zh.html A collection of Chinese corpora and frequency lists] Online query with three corpora | ||
+ | * [http://pears.lib.ohio-state.edu/China/linguist.html Chinese Linguistics] | ||
+ | |||
+ | [[Category:Resources by language|Chinese]] |
Latest revision as of 17:42, 2 September 2019
Tools
Free software
- rseg word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
- ctbparser word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
- ZPar word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
- DuDuPlus: a graph-based dependency parser for English and Chinese ("Other Open Source" license?)
- where is the source code?
Corpora
Free license
- HC Corpora 1606811 lines of Fair Use excerpts from news, blogs, twitter
- UN parallel corpora
Nonfree or Unknown license
- Araneum Sinicum, Gigaword Chinese web corpus
- Chinese Computing
- Word Segmented and POS tagged People Daily Corpus at ICL of Peking University
- Frequency list of characters in the Internet corpus
- Frequency list of lexical items in the Internet corpus
- Lancaster Corpus of Mandarin Chinese
- A collection of Chinese corpora and frequency lists Online query with three corpora
- Chinese Linguistics