Difference between revisions of "Resources for Finnish"
Jump to navigation
Jump to search
(Added: Araneum) |
(distinguish free vs. non-free corpora; +corpus link; etc.) |
||
Line 1: | Line 1: | ||
==Corpora== | ==Corpora== | ||
− | + | ===Free=== | |
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English | * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English | ||
+ | * [http://www.statmt.org/wmt15/translation-task.html WMT News Crawl] monolingual corpus. Currently 14M tokens. | ||
* [http://corpora.informatik.uni-leipzig.de/ Finnish plain text and Co-occurrences at LCC] | * [http://corpora.informatik.uni-leipzig.de/ Finnish plain text and Co-occurrences at LCC] | ||
− | |||
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | * [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style. | ||
+ | |||
+ | ===Non-Free=== | ||
+ | * [http://ucts.uniba.sk/aranea_about/ Araneum Finnicum], Gigaword Finnish web corpus | ||
+ | * [http://www.kielipankki.fi CSC Kielipankki] Language Bank at the [http://www.csc.fi/ CSC] Scientific Computing Centre, including some 200 million word tokens of Finnish texts. | ||
==Morphological analysers== | ==Morphological analysers== |
Revision as of 07:32, 17 June 2015
Corpora
Free
- Europarl corpus, sentence aligned with English
- WMT News Crawl monolingual corpus. Currently 14M tokens.
- Finnish plain text and Co-occurrences at LCC
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
Non-Free
- Araneum Finnicum, Gigaword Finnish web corpus
- CSC Kielipankki Language Bank at the CSC Scientific Computing Centre, including some 200 million word tokens of Finnish texts.
Morphological analysers
Free software
- Omorfi is an Open Morphology for Finnish, in association with the voikko speller project, see also https://kitwiki.csc.fi/twiki/bin/view/KitWiki/OmorfiHFSTVersion for installing with HFST. (LGPL/GPL)