Search results

Page title matches

Microsoft Research Paraphrase Corpus - RTE Users

547 bytes (59 words) - 09:58, 21 December 2009
MSF2 The Portuguese/Spanish corpus of Multi-Sentence Fusion (Repository)
* '''Name of Dataset:''' MSF2 Corpus * '''Citation:''' If you use the MSF2 corpus in your research, please include the following citation in any resulting pa

2 KB (224 words) - 05:01, 4 May 2020

Page text matches

Resources for Pashto
VOA Corpus (small) This corpus is in the public domain

168 bytes (27 words) - 04:46, 11 August 2015
Resources for Hungarian
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...://ucts.uniba.sk/aranea_about/ Araneum Hungaricum], Gigaword Hungarian web corpus

814 bytes (103 words) - 08:44, 26 June 2016
Resources for Spanish
...tp://ucts.uniba.sk/aranea_about/ Araneum Hispanicum], Gigaword Spanish web corpus * [http://www.corpusdelespanol.org/ Corpus del Español] (website only)

1 KB (155 words) - 05:40, 29 June 2020
Resources for Dutch
...p://ucts.uniba.sk/aranea_about/ Araneum Nederlandicum], Gigaword Dutch web corpus * [http://www.statmt.org/europarl Europarl corpus] - sentence-aligned with English

893 bytes (114 words) - 20:04, 5 September 2019
Resources for Portugese
* [http://corporavm.uni-koeln.de/colonia/ Colonia], corpus of historical Portuguese. * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

955 bytes (127 words) - 05:09, 4 May 2020
Multilingual Corpora
*[http://wt.jrc.it/lt/Acquis/ ACQUIS COMMUNAUTAIRE Multilingual Corpus] ...sli.uvigo.es/CLUVI/ CLUVI Corpus (Galician-English-Spanish-French parallel corpus)]

3 KB (480 words) - 10:26, 16 February 2021
Template for Data (Repository)
* '''Name of Dataset:''' ABC Corpus. * '''Citation:''' If you use the ABC Corpus in your research, please include the following citation in any resulting pa

1 KB (187 words) - 19:58, 24 June 2008
SumTime-Meteo
SUMTIME-METEO is a parallel corpus of naturally occurring weather forecast texts and the The corpus has 1045 parallel data-text units and is

1 KB (197 words) - 15:46, 7 February 2009
Resources for Telugu
==Telugu POS tagger, Morph analyzer, Lemmatizer, Corpus== Keywords: Telugu, Part of Speech tagger, Lemmatizer, Morph Analyser, Corpus

1 KB (135 words) - 09:55, 26 May 2014
Resources for Bosnian
* [http://www.tekstlab.uio.no/Bosnian/Corpus.html Oslo Corpus of Bosnian Texts] ...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,

394 bytes (47 words) - 13:44, 26 April 2008
Resources for Russian
...s", the Russian portion is 876 MB, the other languages in the multilingual corpus are: English/French/Spanish/Arabic/Chinese/German ...wmt15/translation-task.html#download WMT corpora], including the Yandex 1M corpus, News Commentary, and News Crawl

2 KB (269 words) - 08:55, 17 June 2015
Resources for Sámi
* [http://gtweb.uit.no/korp/ Corpus for North Sámi, South Sámi, parallel corpus North Sámi - Norwegian] ...torio.uit.no/freecorpus/orig/sme/ Original files + metadata for North Sámi corpus]

1 KB (190 words) - 07:38, 16 August 2017
Resources for Slovak
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://ucts.uniba.sk/aranea_about/ Araneum Slovacum], Gigaword Slovak web corpus

794 bytes (102 words) - 13:28, 8 March 2015
Corpora for English
*[http://americannationalcorpus.org/ American National Corpus (ANC)] ...://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]

5 KB (788 words) - 18:58, 2 September 2019
Resources for Chinese
* [http://ucts.uniba.sk/aranea_about/ Araneum Sinicum], Gigaword Chinese web corpus ...icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]

2 KB (264 words) - 18:42, 2 September 2019
Wikipedia articles
* [http://en.wikipedia.org/wiki/Corpus_linguistics Corpus Linguistics] * [http://en.wikipedia.org/wiki/Text_corpus Text Corpus]

1 KB (163 words) - 08:26, 17 January 2007
Resources for Croatian
....2 mil. tokens synchronic (text from 1990 on), standard Croatian reference corpus; lemmatised and MSD-tagged with the Croatian MultText East tagset using hyb ...Language Corpus] (continuously growing (currently approx. 100 mil. tokens) corpus of Croatian covering various genres and time periods, using Philologic for

2 KB (233 words) - 05:17, 25 June 2012
Resources for Bulgarian
* [http://www.statmt.org/setimes/ Southeast European Times], sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

1 KB (148 words) - 09:36, 26 May 2014
Resources for Turkish
* [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, * [http://tscorpus.com/ TS Corpus] (PoSTagged Turkish Corpus. The corpus also presents morphological and lemma tags of the data. Consists of 491 Mil

2 KB (251 words) - 08:40, 17 June 2015
Resources for Slovenian
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://nl.ijs.si/elan/ IJS - ELAN] Slovene-English Parallel Corpus

1 KB (141 words) - 09:52, 26 May 2014
Resources for Danish
* [http://www.isv.cbs.dk/~mbk/treebank/ PAROLE Corpus (SGML format)] (GPL) * [http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus]

1 KB (174 words) - 09:38, 26 May 2014
Resources for Swahili
* [http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]

151 bytes (21 words) - 14:35, 26 April 2008
Resources for Indonesian
...ww.panl10n.net/english/OutputsIndonesia2.htm 500,000 Word Bahasa Indonesia Corpus and Parallel English Translation] (A-NC-SA 3.0 licence) ...n.net/english/OutputsIndonesia2.htm 500,000 Word Bahasa Indonesia Parallel Corpus with Penn Treebank] (A-NC-SA 3.0 licence)

1 KB (174 words) - 23:28, 14 November 2018
Resources for Hindi
==POS Tagger, Morphological Analyzer, Lemmatizer, Corpus== The tagger and its related files are distributed under GNU GPL license. Corpus is licensed.

2 KB (295 words) - 09:18, 30 June 2014
Resources for Swedish
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...p://www.ling.su.se/staff/sofia/suc/suc.html Stockholm Umeå Corpus] (Tagged Corpus, freely available for research purposes)

1 KB (169 words) - 05:38, 29 June 2020
MSF2 The Portuguese/Spanish corpus of Multi-Sentence Fusion (Repository)
* '''Name of Dataset:''' MSF2 Corpus * '''Citation:''' If you use the MSF2 corpus in your research, please include the following citation in any resulting pa

2 KB (224 words) - 05:01, 4 May 2020
Resources for Galician
* [http://sli.uvigo.es/CTG/ Technical Corpus of Galician (CTG)] * [http://sli.uvigo.es/CTAG/ POS-tagged Technical Corpus of Galician (CTAG)]

2 KB (308 words) - 11:21, 4 August 2014
Resources for Polish
* [http://ucts.uniba.sk/aranea_about/ Araneum Polonicum], Gigaword Polish web corpus * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

3 KB (459 words) - 13:22, 8 March 2015
Lists of resources
*[http://devoted.to/corpora Corpus-based Linguists (site maintained by David Lee)] ...tanford.edu/links/statnlp.html Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources]

2 KB (305 words) - 00:23, 13 February 2007
Resources for Romanian
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,

1 KB (131 words) - 09:50, 26 May 2014
Resources for Italian
* [http://ucts.uniba.sk/aranea_about/ Araneum Italicum], Gigaword Italian web corpus * [http://www.istc.cnr.it/material/database/colfis/ ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto]

3 KB (456 words) - 15:24, 15 March 2019
Resources for Serbian
* [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,

291 bytes (32 words) - 13:06, 25 March 2010
Resources for French
* [http://www.statmt.org/wmt10/training-giga-fren.tar 10^9 French-English corpus] ...//ucts.uniba.sk/aranea_about/ Araneum Francogallicum], Gigaword French web corpus

3 KB (389 words) - 08:57, 17 June 2015
Resources for Estonian
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

398 bytes (50 words) - 09:42, 26 May 2014
1443 Semantically Annotated Compound Nouns (Repository)
* '''Citation:''' If you use this corpus in your research, please include the following citation in any resulting pa * '''Description:''' 1,443 compound nouns extracted from the British National Corpus and annotated with semantic relations. For more information and pointers to

1 KB (151 words) - 09:22, 24 October 2012
Resources for Montenegrin
...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,

319 bytes (34 words) - 07:02, 20 September 2007
MEN Test Collection (State of the art)
| Corpus-based, predictive | Corpus-based, distributional

5 KB (590 words) - 02:05, 6 September 2020
Citations of the Diverse Noun Compound Dataset
...s/compare_contexts_NMRs.pdf Learning noun-modifier semantic relations with corpus-based and Wordnet-based features]. In ''Proceedings of the 21st National Co ...ter D. and Michael L. Littman. (2005). [http://arxiv.org/abs/cs.LG/0508103 Corpus-based learning of analogies and semantic relations]. ''Machine Learning'',

2 KB (197 words) - 17:57, 3 January 2007
Data sets for NLG blog
The SumTime corpus is structured as a database, and presented in text (CSV) and MDB (Microsoft ...s] and [https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip SumTime corpus] instead.

2 KB (353 words) - 14:09, 6 August 2020
Resources for Finnish
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...p://www.statmt.org/wmt15/translation-task.html WMT News Crawl] monolingual corpus. Currently 14M tokens.

2 KB (300 words) - 04:38, 29 June 2020
Resources for German
...ed German-English phrase-aligned parallel corpus, a subset of the EuroParl corpus (4000 sentences for each language, the tool at least is LGPL) ...ttp://ucts.uniba.sk/aranea_about/ Araneum Germanicum], Gigaword German web corpus

4 KB (575 words) - 02:10, 26 August 2016
Resources for Lithuanian
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

425 bytes (49 words) - 21:18, 16 December 2015
Resources for Macedonian
* [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, ...skiTaggingSiKDD2005.pdf Learning PoS tagging from a tagged Macedonian text corpus]". ''Proceedings of SiKDD 2005 (Conference on Data Mining and Data Warehous

2 KB (195 words) - 17:04, 7 October 2010
Resources for Serbo-Croatian
...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,

323 bytes (34 words) - 07:40, 8 January 2008
Resources for Czech
* [http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus]

548 bytes (72 words) - 08:56, 17 June 2015
CONLL-2003 (State of the art)
* '''Recall:''' percentage of named entities defined in the corpus that were found by the program * '''Training data:''' Train split of CONLL-2003 corpus

3 KB (378 words) - 07:29, 12 July 2019
Sharing data and evaluation (NLG)
...es/inlg2006specialsession/INLG-0626.pdf Evaluations of NLG Systems: Common Corpus and Tasks or Common Dimensions and Metrics?] ...s/inlg2006specialsession/INLG-0627.pdf Building a Semantically Transparent Corpus for the Generation of Referring Expressions.]

3 KB (361 words) - 05:44, 8 February 2009
Resources for Japanese
* [http://www.ninjal.ac.jp/english/products/bccwj/ Balanced Corpus of Contemporary Written Japanese (BCCWJ)] (subset is web searchable at Koto * [http://www.edrdg.org/projects/tanaka/tanakacorpus.html Tanaka Corpus] by Tanaka Yasuhito, edited by Jim Breen, under a CC-BY-SA 3.0 licence

4 KB (558 words) - 20:40, 11 October 2017
Resources for Persian
*[http://www.ling.ohio-state.edu/~jonsafari/corpora VOA Persian Corpus 2003-2008] (public domain) *[https://www.clarin.si/repository/xmlui/handle/11356/1042 Orwell's 1984 Corpus in MULTEXT-EAST] (public domain)

5 KB (619 words) - 09:58, 23 February 2016
Spam filtering datasets
...n of datasets that contains spam messages, and ham messages from the Enron corpus. See [http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf this article] fo

814 bytes (135 words) - 09:07, 19 November 2006

Search results

Page title matches

Page text matches

Navigation menu

Search