Search results

Jump to navigation Jump to search

Page title matches

Page text matches

  • VOA Corpus (small) This corpus is in the public domain
    168 bytes (27 words) - 04:46, 11 August 2015
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...://ucts.uniba.sk/aranea_about/ Araneum Hungaricum], Gigaword Hungarian web corpus
    814 bytes (103 words) - 08:44, 26 June 2016
  • ...tp://ucts.uniba.sk/aranea_about/ Araneum Hispanicum], Gigaword Spanish web corpus * [http://www.corpusdelespanol.org/ Corpus del Español] (website only)
    1 KB (155 words) - 05:40, 29 June 2020
  • ...p://ucts.uniba.sk/aranea_about/ Araneum Nederlandicum], Gigaword Dutch web corpus * [http://www.statmt.org/europarl Europarl corpus] - sentence-aligned with English
    893 bytes (114 words) - 20:04, 5 September 2019
  • * [http://corporavm.uni-koeln.de/colonia/ Colonia], corpus of historical Portuguese. * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    955 bytes (127 words) - 05:09, 4 May 2020
  • *[http://wt.jrc.it/lt/Acquis/ ACQUIS COMMUNAUTAIRE Multilingual Corpus] ...sli.uvigo.es/CLUVI/ CLUVI Corpus (Galician-English-Spanish-French parallel corpus)]
    3 KB (480 words) - 10:26, 16 February 2021
  • * '''Name of Dataset:''' ABC Corpus. * '''Citation:''' If you use the ABC Corpus in your research, please include the following citation in any resulting pa
    1 KB (187 words) - 19:58, 24 June 2008
  • SUMTIME-METEO is a parallel corpus of naturally occurring weather forecast texts and the The corpus has 1045 parallel data-text units and is
    1 KB (197 words) - 15:46, 7 February 2009
  • ==Telugu POS tagger, Morph analyzer, Lemmatizer, Corpus== Keywords: Telugu, Part of Speech tagger, Lemmatizer, Morph Analyser, Corpus
    1 KB (135 words) - 09:55, 26 May 2014
  • * [http://www.tekstlab.uio.no/Bosnian/Corpus.html Oslo Corpus of Bosnian Texts] ...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    394 bytes (47 words) - 13:44, 26 April 2008
  • ...s", the Russian portion is 876 MB, the other languages in the multilingual corpus are: English/French/Spanish/Arabic/Chinese/German ...wmt15/translation-task.html#download WMT corpora], including the Yandex 1M corpus, News Commentary, and News Crawl
    2 KB (269 words) - 08:55, 17 June 2015
  • * [http://gtweb.uit.no/korp/ Corpus for North Sámi, South Sámi, parallel corpus North Sámi - Norwegian] ...torio.uit.no/freecorpus/orig/sme/ Original files + metadata for North Sámi corpus]
    1 KB (190 words) - 07:38, 16 August 2017
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://ucts.uniba.sk/aranea_about/ Araneum Slovacum], Gigaword Slovak web corpus
    794 bytes (102 words) - 13:28, 8 March 2015
  • *[http://americannationalcorpus.org/ American National Corpus (ANC)] ...://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
    5 KB (788 words) - 18:58, 2 September 2019
  • * [http://ucts.uniba.sk/aranea_about/ Araneum Sinicum], Gigaword Chinese web corpus ...icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
    2 KB (264 words) - 18:42, 2 September 2019
  • * [http://en.wikipedia.org/wiki/Corpus_linguistics Corpus Linguistics] * [http://en.wikipedia.org/wiki/Text_corpus Text Corpus]
    1 KB (163 words) - 08:26, 17 January 2007
  • ....2 mil. tokens synchronic (text from 1990 on), standard Croatian reference corpus; lemmatised and MSD-tagged with the Croatian MultText East tagset using hyb ...Language Corpus] (continuously growing (currently approx. 100 mil. tokens) corpus of Croatian covering various genres and time periods, using Philologic for
    2 KB (233 words) - 05:17, 25 June 2012
  • * [http://www.statmt.org/setimes/ Southeast European Times], sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    1 KB (148 words) - 09:36, 26 May 2014
  • * [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, * [http://tscorpus.com/ TS Corpus] (PoSTagged Turkish Corpus. The corpus also presents morphological and lemma tags of the data. Consists of 491 Mil
    2 KB (251 words) - 08:40, 17 June 2015
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://nl.ijs.si/elan/ IJS - ELAN] Slovene-English Parallel Corpus
    1 KB (141 words) - 09:52, 26 May 2014
  • * [http://www.isv.cbs.dk/~mbk/treebank/ PAROLE Corpus (SGML format)] (GPL) * [http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus]
    1 KB (174 words) - 09:38, 26 May 2014
  • * [http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]
    151 bytes (21 words) - 14:35, 26 April 2008
  • ...ww.panl10n.net/english/OutputsIndonesia2.htm 500,000 Word Bahasa Indonesia Corpus and Parallel English Translation] (A-NC-SA 3.0 licence) ...n.net/english/OutputsIndonesia2.htm 500,000 Word Bahasa Indonesia Parallel Corpus with Penn Treebank] (A-NC-SA 3.0 licence)
    1 KB (174 words) - 23:28, 14 November 2018
  • ==POS Tagger, Morphological Analyzer, Lemmatizer, Corpus== The tagger and its related files are distributed under GNU GPL license. Corpus is licensed.
    2 KB (295 words) - 09:18, 30 June 2014
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...p://www.ling.su.se/staff/sofia/suc/suc.html Stockholm Umeå Corpus] (Tagged Corpus, freely available for research purposes)
    1 KB (169 words) - 05:38, 29 June 2020
  • * '''Name of Dataset:''' MSF2 Corpus * '''Citation:''' If you use the MSF2 corpus in your research, please include the following citation in any resulting pa
    2 KB (224 words) - 05:01, 4 May 2020
  • * [http://sli.uvigo.es/CTG/ Technical Corpus of Galician (CTG)] * [http://sli.uvigo.es/CTAG/ POS-tagged Technical Corpus of Galician (CTAG)]
    2 KB (308 words) - 11:21, 4 August 2014
  • * [http://ucts.uniba.sk/aranea_about/ Araneum Polonicum], Gigaword Polish web corpus * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    3 KB (459 words) - 13:22, 8 March 2015
  • *[http://devoted.to/corpora Corpus-based Linguists (site maintained by David Lee)] ...tanford.edu/links/statnlp.html Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources]
    2 KB (305 words) - 00:23, 13 February 2007
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English * [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    1 KB (131 words) - 09:50, 26 May 2014
  • * [http://ucts.uniba.sk/aranea_about/ Araneum Italicum], Gigaword Italian web corpus * [http://www.istc.cnr.it/material/database/colfis/ ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto]
    3 KB (456 words) - 15:24, 15 March 2019
  • * [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    291 bytes (32 words) - 13:06, 25 March 2010
  • * [http://www.statmt.org/wmt10/training-giga-fren.tar 10^9 French-English corpus] ...//ucts.uniba.sk/aranea_about/ Araneum Francogallicum], Gigaword French web corpus
    3 KB (389 words) - 08:57, 17 June 2015
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    398 bytes (50 words) - 09:42, 26 May 2014
  • * '''Citation:''' If you use this corpus in your research, please include the following citation in any resulting pa * '''Description:''' 1,443 compound nouns extracted from the British National Corpus and annotated with semantic relations. For more information and pointers to
    1 KB (151 words) - 09:22, 24 October 2012
  • ...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    319 bytes (34 words) - 07:02, 20 September 2007
  • | Corpus-based, predictive | Corpus-based, distributional
    5 KB (590 words) - 02:05, 6 September 2020
  • ...s/compare_contexts_NMRs.pdf Learning noun-modifier semantic relations with corpus-based and Wordnet-based features]. In ''Proceedings of the 21st National Co ...ter D. and Michael L. Littman. (2005). [http://arxiv.org/abs/cs.LG/0508103 Corpus-based learning of analogies and semantic relations]. ''Machine Learning'',
    2 KB (197 words) - 17:57, 3 January 2007
  • The SumTime corpus is structured as a database, and presented in text (CSV) and MDB (Microsoft ...s] and [https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip SumTime corpus] instead.
    2 KB (353 words) - 14:09, 6 August 2020
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English ...p://www.statmt.org/wmt15/translation-task.html WMT News Crawl] monolingual corpus. Currently 14M tokens.
    2 KB (300 words) - 04:38, 29 June 2020
  • ...ed German-English phrase-aligned parallel corpus, a subset of the EuroParl corpus (4000 sentences for each language, the tool at least is LGPL) ...ttp://ucts.uniba.sk/aranea_about/ Araneum Germanicum], Gigaword German web corpus
    4 KB (575 words) - 02:10, 26 August 2016
  • * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    425 bytes (49 words) - 21:18, 16 December 2015
  • * [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, ...skiTaggingSiKDD2005.pdf Learning PoS tagging from a tagged Macedonian text corpus]". ''Proceedings of SiKDD 2005 (Conference on Data Mining and Data Warehous
    2 KB (195 words) - 17:04, 7 October 2010
  • ...ona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    323 bytes (34 words) - 07:40, 8 January 2008
  • * [http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus]
    548 bytes (72 words) - 08:56, 17 June 2015
  • * '''Recall:''' percentage of named entities defined in the corpus that were found by the program * '''Training data:''' Train split of CONLL-2003 corpus
    3 KB (378 words) - 07:29, 12 July 2019
  • ...es/inlg2006specialsession/INLG-0626.pdf Evaluations of NLG Systems: Common Corpus and Tasks or Common Dimensions and Metrics?] ...s/inlg2006specialsession/INLG-0627.pdf Building a Semantically Transparent Corpus for the Generation of Referring Expressions.]
    3 KB (361 words) - 05:44, 8 February 2009
  • * [http://www.ninjal.ac.jp/english/products/bccwj/ Balanced Corpus of Contemporary Written Japanese (BCCWJ)] (subset is web searchable at Koto * [http://www.edrdg.org/projects/tanaka/tanakacorpus.html Tanaka Corpus] by Tanaka Yasuhito, edited by Jim Breen, under a CC-BY-SA 3.0 licence
    4 KB (558 words) - 20:40, 11 October 2017
  • *[http://www.ling.ohio-state.edu/~jonsafari/corpora VOA Persian Corpus 2003-2008] (public domain) *[https://www.clarin.si/repository/xmlui/handle/11356/1042 Orwell's 1984 Corpus in MULTEXT-EAST] (public domain)
    5 KB (619 words) - 09:58, 23 February 2016
  • ...n of datasets that contains spam messages, and ham messages from the Enron corpus. See [http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf this article] fo
    814 bytes (135 words) - 09:07, 19 November 2006
  • ==Kannada POS tagger, Morph analyzer, Corpus== [http://sivareddy.in/downloads Download]. [http://corpus.leeds.ac.uk/tools/ Alternate source]
    751 bytes (101 words) - 03:43, 24 November 2011
  • ...coverage parser for the English language. An evaluation with the [[SUSANNE corpus]] shows that MINIPAR achieves about 88% precision and 80% recall with respe
    737 bytes (99 words) - 11:58, 17 November 2006
  • * '''Citation:''' If you use the TempEval-3 Platinum corpus in your research, please include the following citation in any resulting pa ...and temporal relations by multiple experts and an adjudicator. This is the corpus used to rank participant systems in the TempEval-3 evaluation exercise. Ann
    2 KB (250 words) - 10:44, 23 April 2013
  • ...pusa.net/XXmendea/Konts_arrunta_fr.html XX century's Basque corpus] Basque corpus XX century * [http://www.ztcorpusa.net ZT corpus] Basque Corpus of Science and Technology
    5 KB (728 words) - 09:35, 26 May 2014
  • * July 15, 2011 Completion of corpus selection [TBC]
    622 bytes (71 words) - 10:26, 6 April 2011
  • '''Title:''' ''SeedLing: Building and Using a Seed corpus for the Human Language Project''<br> '''Note:''' Plaintext corpus for >1000 languages with python API<br>
    3 KB (403 words) - 07:46, 29 June 2014
  • * [http://www.degruyter.com/journals/cllt Corpus Linguistics and Linguistic Theory] * [http://www.degruyter.com/journals/cllt Corpus Linguistics and Linguistic Theory]
    7 KB (866 words) - 14:12, 11 November 2018
  • ....is/icelandic_treebank/Download IcePaHC] - the Icelandic Parsed Historical Corpus. 440000 words (12th-19th century texts, phrase structure + PoS + lemma anno
    885 bytes (102 words) - 01:09, 15 April 2011
  • * [http://optima.jrc.it/Acquis/ JRC-Acquis] parallel corpus, 20926909 words, Maltese sentence-aligned with 22 other languages. Public d
    730 bytes (100 words) - 15:40, 20 June 2011
  • ...ttp://www.statmt.org/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian,
    631 bytes (63 words) - 16:59, 7 October 2010
  • | We parsed this corpus using Minipar, extracted subject-predicate-object triples from the results,
    862 bytes (99 words) - 06:28, 22 December 2009
  • | Corpus-based | Corpus-based
    5 KB (687 words) - 11:23, 28 June 2015
  • ! Corpus, window size, vector size | 5B corpus (Araneum + Wikipedia + UkWac), window 3, 1000 dimensions
    4 KB (521 words) - 15:14, 25 January 2017
  • | corpus-based | corpus-based
    2 KB (276 words) - 12:42, 28 June 2015
  • ...nym dictionaries: as acronym dictionary constructed automatically from the corpus and a synonym dictionary that contains geographical terms.
    935 bytes (108 words) - 06:23, 27 September 2011
  • ...y viable given recent advances in NLP and machine learning technology, and corpus availability.
    923 bytes (128 words) - 05:15, 25 June 2012
  • ...ims.uni-stuttgart.de/projekte/TIGER/ Linguistic Interpretation of a German Corpus]† *[http://ysomeya.hp.infoseek.co.jp/ Online Business Letter Corpus KWIC Concordancer]†
    19 KB (2,777 words) - 03:00, 12 September 2019
  • | Corpus-based | Corpus-based
    9 KB (1,199 words) - 09:37, 16 June 2020
  • | Corpus-based | Corpus-based
    9 KB (1,170 words) - 10:03, 22 March 2017
  • * '''Training data:''' sections 2-21 of Wall Street Journal corpus * '''Testing data:''' section 23 of Wall Street Journal corpus
    3 KB (437 words) - 14:23, 28 October 2013
  • === Methodius Corpus === === The Wikipedia company corpus ===
    14 KB (1,963 words) - 22:03, 30 May 2021
  • ...ipants attempt to recognize those senses after tuning their systems with a corpus of training data. ===Corpus based===
    12 KB (1,892 words) - 05:21, 12 December 2014
  • | Corpus-based | Corpus-based
    8 KB (1,015 words) - 18:33, 15 September 2019
  • | Do19-corpus | Corpus-based
    7 KB (871 words) - 18:34, 15 September 2019
  • :: <li> A corpus-based algorithm that clusters similar words together. A polysemous word may :: <li> An evaluation of many different corpus-based measures of word similarity, using multiple-choice synonym questions.
    7 KB (992 words) - 05:12, 25 June 2012
  • ! Corpus and window size | 6B Google News corpus, window 10
    6 KB (743 words) - 15:14, 25 January 2017
  • * [http://www.nyu.edu/projects/bowman/multinli/ The MultiGenre NLI Corpus] (433k examples, used in the [https://repeval2017.github.io/shared/ RepEval ....stanford.edu/projects/snli The Stanford Natural Language Inference (SNLI) corpus], a 570k example manually-annotated TE dataset with accompanying leaderboar
    12 KB (1,782 words) - 09:31, 29 May 2017
  • ...com/bncweb/home.html BNCweb: A Web-Based Interface to the British National Corpus] *[http://pie.usna.edu/explorec.html Chargrams Database from British National Corpus]
    15 KB (2,184 words) - 01:55, 16 October 2013
  • * ADCR2020T001 (Mai 4, 2020): [[MSF2 The Portuguese/Spanish corpus of Multi-Sentence Fusion (Repository)]]
    1 KB (160 words) - 02:34, 4 May 2020
  • * [http://www.statmt.org/setimes/ Southeast European Times], sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, * [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
    3 KB (455 words) - 09:44, 26 May 2014
  • : corpora, web as corpus, data-intensive linguistics, linguistic annotation, Unicode
    2 KB (259 words) - 10:02, 21 June 2008
  • | Corpus-based | Corpus-based
    13 KB (1,581 words) - 18:32, 15 September 2019
  • * '''Recall:''' percentage of NPs defined in the corpus that were found by the chunking program * '''Training data:''' sections 15-18 of Wall Street Journal corpus (Ramshaw and Marcus)
    5 KB (635 words) - 12:26, 27 August 2015
  • | corpus-based
    2 KB (190 words) - 15:11, 25 January 2017
  • ...ontribution and add "(Repository)" to the name (e.g., "Wall Street Journal Corpus (Repository)"). Add a link to a new ACL Wiki entry in [[Resources by Date (
    3 KB (421 words) - 09:20, 10 June 2008
  • ...stributional similarity based features extracted from the English Gigaword corpus.
    2 KB (283 words) - 09:32, 18 September 2012
  • * [http://sgi.nu/enron/mailinglist.php Enron Email Corpus Mailing List]
    3 KB (417 words) - 16:22, 15 January 2021
  • ...y for discovering a set of inference rules (or paraphrases) by analyzing a corpus of natural language text.
    2 KB (352 words) - 07:43, 8 January 2008
  • ; [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97T12 DSO Corpus of Sense-Tagged English] ...uns and 70 verbs occurring in the Brown Corpus and ''Wall Street Journal'' corpus
    11 KB (1,457 words) - 05:42, 12 December 2014
  • * '''Recall:''' percentage of named entities defined in the corpus that were found by the program
    2 KB (319 words) - 07:51, 7 August 2007
  • |[http://cental.fltr.ucl.ac.be/wac3/ WAC 2007] || 3rd Web as Corpus Workshop 2007 || || Louvain-la-Neuve, Belgium || 15-16 September
    3 KB (368 words) - 09:49, 30 November 2007
  • ...ly evolve into a treebank as a lingulistic resource, and an (un-annotated) corpus of non-fictional (mainly newspaper) and fictional texts
    2 KB (217 words) - 05:17, 25 June 2012
  • Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many
    2 KB (346 words) - 08:41, 19 December 2012
  • ...of the corpus-based algorithms of Statistical Semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexi * Turney, P.D., and Littman, M.L. (2005). Corpus-based learning of analogies and semantic relations. ''Machine Learning'', 6
    8 KB (1,095 words) - 13:36, 25 May 2010
  • ...s/compare_contexts_NMRs.pdf Learning noun-modifier semantic relations with corpus-based and Wordnet-based features]. In ''Proceedings of the 21st National Co ...ter D. and Michael L. Littman. (2005). [http://arxiv.org/abs/cs.LG/0508103 Corpus-based learning of analogies and semantic relations]. ''Machine Learning'',
    5 KB (581 words) - 05:16, 25 June 2012
  • .../www.phon.ucl.ac.uk/home/alex/project/tagging/icetag.htm The International Corpus of English (ICE) Tagset] ...tp://www.ling.gu.se/~lager/taglog.html A Logical Approach To Computational Corpus Linguistics]
    9 KB (1,343 words) - 15:39, 7 February 2009
  • The Sketch Engine is a web-based program which takes as its input a corpus of any language with an appropriate level of linguistic mark-up. The Sketch * '''the Concordancer''' A program which displays all occurrences from the corpus for a given query. The program is very powerful with a wide variety of quer
    27 KB (4,573 words) - 12:19, 11 May 2012
  • * CORPUS LINGUISTICS 2001
    3 KB (401 words) - 20:06, 4 November 2007
  • * [http://gmb.let.rug.nl Groningen Meaning Bank] - annotated corpus (based on CCG and DRT)
    2 KB (297 words) - 05:14, 25 June 2012
  • ...Role Labelling, Entity Recognition Tools, Similarity / Relatedness Tools, Corpus Readers, Related Libraries), Links.
    2 KB (306 words) - 03:42, 7 August 2013

View (previous 100 | next 100) (20 | 50 | 100 | 250 | 500)