Difference between revisions of "Resources for Polish"

From ACL Wiki
Jump to navigation Jump to search
(Added: Araneum)
 
(11 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
==Corpora==
 
==Corpora==
 +
* [http://ucts.uniba.sk/aranea_about/ Araneum Polonicum], Gigaword Polish web corpus
 +
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
 
* [http://korpus.pl/en/ IPI PAN Corpus] - The IPI PAN Corpus is a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish, developed by the Linguistic Engineering Group at the  Institute of Computer Science, Polish Academy of Sciences (ICS PAS)
 
* [http://korpus.pl/en/ IPI PAN Corpus] - The IPI PAN Corpus is a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish, developed by the Linguistic Engineering Group at the  Institute of Computer Science, Polish Academy of Sciences (ICS PAS)
 +
* [http://korpus.pwn.pl/index_en.php PWN Corpus] - PWN has prepared and made available an online version of the Corpus of Polish consisting of 40 million words. The samples were taken from 386 books, 977 editions selected from 185 different press publications, 84 transcribed spoken texts, 207 web sites and several hundred advertising leaflets and other ephemera. The full version of the corpus is available on payment for access, while a demonstration version of over 7.5 million words is available free of charge.
  
==Parsers==
+
==Taggers, parsers, morphology analysers==
 +
 
 +
==Free/Open Source Software==
 +
* [http://morfologik.blogspot.com/ Morfologik] -- morphological dictionary by Marcin Miłkowski (of LanguageTool), licensed under CC-SA / GNU LGPL
 +
** [http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted Morfologik converted to the IKIPI tagset] (the tagset of the IPI PAN Corpus)
 +
* [http://nlp.pwr.wroc.pl/en/tools-and-resources/narzedzia-przetwarzania-morfosyntaktycznego Morphosyntactic Toolchain] by WrocUT Language Technology Group G4.19, licensed under GNU LGPL (some optional addons are GNU GPL). Command-line utilities providing tokenisation, morphological analysis, morphosyntactic tagging, shallow parsing (chunking), WCCL feature vectors for machine learning.
 +
 
 +
==Unknown license==
 +
* [http://nlp.ipipan.waw.pl/~wolinski/morfeusz/ "Morfeusz"] - morphological analyser of Polish (Wolinski, 2005),
 +
** [http://www.springerlink.com/content/l101v8823391j568/ main reference] Morfeusz — a Practical Tool for the Morphological Analysis of Polish
 +
* "AMOR" - morphology analyser of Polish (Joanna Rabiega, 2000),
 +
** [http://members.chello.pl/jrw/doc/jr_ma.pdf/ main reference] Podstawy lingwistyczne automatycznego analizatora morfologicznego AMOR
 +
* [http://duch.mimuw.edu.pl/~kszafran/index.php?option=com_docman&task=cat_view&gid=33&Itemid=43 "SAM"] - morphological analyser of Polish (Krzysztof Szafran, 1994),
 +
* [http://sourceforge.net/project/showfiles.php?group_id=166344 Morfologik] - Polish morphological analyzer based on current ispell dictionaries, and Java libraries interfacing it. First completely open-source and comprehensive morphological tools for Polish. Will be used for grammar correction tools (to be included in the future)
 
* [http://nlp.ipipan.waw.pl/Spejd/ Spejd - Shallow Parsing and Disambiguation Engine]  
 
* [http://nlp.ipipan.waw.pl/Spejd/ Spejd - Shallow Parsing and Disambiguation Engine]  
* [http://nlp.ipipan.waw.pl/~wolinski/swigra/ Świgra] - a DCG Parser of Polish
+
* [http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml lemmatizer] - Dawid Weiss
* [http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml Dawid Weiss] - lemmmatizer Polish
 
  
 
==Lexical resources==
 
==Lexical resources==
 
+
* [http://plwordnet.pwr.wroc.pl/wordnet/ plWordnet] - a lexico-semantic database of Polish language.
 +
* [https://play.google.com/store/apps/details?id=com.pwr.plwordnet Mobile plWordNet] - free mobile application for plWordNet browsing.
  
 
==Bibliography==
 
==Bibliography==

Latest revision as of 12:22, 8 March 2015

Corpora

  • Araneum Polonicum, Gigaword Polish web corpus
  • Europarl corpus, sentence aligned with English
  • IPI PAN Corpus - The IPI PAN Corpus is a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish, developed by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS)
  • PWN Corpus - PWN has prepared and made available an online version of the Corpus of Polish consisting of 40 million words. The samples were taken from 386 books, 977 editions selected from 185 different press publications, 84 transcribed spoken texts, 207 web sites and several hundred advertising leaflets and other ephemera. The full version of the corpus is available on payment for access, while a demonstration version of over 7.5 million words is available free of charge.

Taggers, parsers, morphology analysers

Free/Open Source Software

  • Morfologik -- morphological dictionary by Marcin Miłkowski (of LanguageTool), licensed under CC-SA / GNU LGPL
  • Morphosyntactic Toolchain by WrocUT Language Technology Group G4.19, licensed under GNU LGPL (some optional addons are GNU GPL). Command-line utilities providing tokenisation, morphological analysis, morphosyntactic tagging, shallow parsing (chunking), WCCL feature vectors for machine learning.

Unknown license

  • "Morfeusz" - morphological analyser of Polish (Wolinski, 2005),
    • main reference Morfeusz — a Practical Tool for the Morphological Analysis of Polish
  • "AMOR" - morphology analyser of Polish (Joanna Rabiega, 2000),
    • main reference Podstawy lingwistyczne automatycznego analizatora morfologicznego AMOR
  • "SAM" - morphological analyser of Polish (Krzysztof Szafran, 1994),
  • Morfologik - Polish morphological analyzer based on current ispell dictionaries, and Java libraries interfacing it. First completely open-source and comprehensive morphological tools for Polish. Will be used for grammar correction tools (to be included in the future)
  • Spejd - Shallow Parsing and Disambiguation Engine
  • lemmatizer - Dawid Weiss

Lexical resources

  • plWordnet - a lexico-semantic database of Polish language.
  • Mobile plWordNet - free mobile application for plWordNet browsing.

Bibliography

External links