Difference between revisions of "Corpora for English"

From ACL Wiki
Jump to navigation Jump to search
m (Move *[http://www.grsampson.net/RSue.html SUSANNE Analytic Scheme] from Uncategorized resource to Resources for English, Corpora for English, Free and Downloadable)
 
(23 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
For languages other than English, see [[List of resources by language]].
 
For languages other than English, see [[List of resources by language]].
 
<div class="usermessage">
 
* Please help us move non-English items below into the [[List of resources by language]].
 
</div>
 
 
==English==
 
 
<!-- Please keep this list in alphabetical order -->
 
<!-- Please keep this list in alphabetical order -->
  
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
+
===Free and Downloadable===
 
*[http://americannationalcorpus.org/ American National Corpus (ANC)]
 
*[http://americannationalcorpus.org/ American National Corpus (ANC)]
*[http://americannationalcorpus.org/FirstRelease/ AMERICAN NATIONAL CORPUS FIRST RELEASE]
 
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
 
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus]
 
*[http://devoted.to/corpora Bookmarks for Corpus-based Linguists]
 
*[http://info.ox.ac.uk/bnc/ British National Corpus (from Oxford University)]
 
*[http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
 
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 
 
*[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
 
*[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
*[http://www.athel.com/corpdes.html Corpus of Spoken Professional English]
 
 
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
 
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
*[http://etext.lib.virginia.edu/ Electronic Text Center -- University of Virginia]
 
*[http://www.phon.ox.ac.uk/~esther/ivyweb/ English Intonation in the British Isles -The IViE Corpus]
 
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
 
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)]
+
*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
+
*[https://corpling.uis.georgetown.edu/gum/ GUM - Georgetown University Multilayer corpus], multiple parses, coreference, entities, sentence types and RST
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
+
*[https://www.gutenberg.org Project Gutenberg]
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
+
*[http://www.ucl.ac.uk/english-usage/ice/avail.htm International Corpus of English]
 +
*[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
 
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
 
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
*[http://nora.hd.uib.no/icame.html ICAME]
 
 
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
 
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords]
 
 
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data]
 
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data]
*[http://www.cs.pitt.edu/mpqa/ Multi-Perspective Question Answering (MPQA)]
 
 
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources]
 
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources]
*[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
 
*[http://pie.usna.edu/ Phrases in English]
 
*[http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps Restricted English Corpus from Dr. Caroline Lyon for PhD]
 
*[http://www.sketchengine.co.uk/ Sketch Engine]
 
 
*[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus]
 
*[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus]
*[http://clix.to/davidlee00 The BNC Index (for the BNCWorld Edition)]
+
*[http://www.grsampson.net/RSue.html SUSANNE Analytic Scheme]
 
*[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English]
 
*[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English]
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm The Dialogue Diversity Corpus]
 
 
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation]
 
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation]
 
*[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus]
 
*[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus]
 +
*[http://ebiquity.umbc.edu/resource/html/id/351 UMBC Webbase Corpus]
 +
*[http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
 +
*[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus]
 +
* [http://www.statmt.org/wmt15/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
 +
 +
===Proprietary or Require Prior Permission===
 +
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus
 +
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus
 +
*[http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 +
*[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb]
 +
*[http://www.athel.com/cpsa.html Corpus of Spoken Professional English]
 +
*[http://www.phon.ox.ac.uk/~esther/ivyweb/ English Intonation in the British Isles -The IViE Corpus]
 +
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)]
 +
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
 +
*[http://mpqa.cs.pitt.edu Multi-Perspective Question Answering (MPQA)]
 +
*[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
 +
*[http://www.sketchengine.co.uk/ Sketch Engine]
 +
*[http://wacky.sslmit.unibo.it/ WaCky]
 
*[http://www.webcorp.org.uk/guide/ WebCorp]
 
*[http://www.webcorp.org.uk/guide/ WebCorp]
  
==Galician==
 
<!-- Please keep this list in alphabetical order -->
 
*[http://sli.uvigo.es/CLUVI/ Linguistic Corpus of the University of Vigo (CLUVI)]
 
*[http://sli.uvigo.es/CTG/ Technical Corpus of Galician (CTG)]
 
*[http://www.ti.usc.es/TILG/ Tesouro informatizado da lingua galega (TILG)]
 
  
==German==
+
<!-- Dead links
<!-- Please keep this list in alphabetical order -->
+
*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
 
+
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
*[http://www.phonetik.uni-muenchen.de/Bas/BasKorporaeng.html Bavarian Archive for Speech Signals Corpora]
+
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
*[http://corpora.ids-mannheim.de/~cosmas/ COSMAS II]
+
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus]
*[http://www.coli.uni-sb.de/sfb378/negra-corpus/negra-corpus.html NEGRA Corpus]
+
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
 
+
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
==Iranian==
+
*[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus]
<!-- Please keep this list in alphabetical order -->
+
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 
+
*[http://etext.lib.virginia.edu/ Electronic Text Center -- University of Virginia]
*[http://ece.ut.ac.ir/DBRG/Bijankhan/ Bijankhan corpus]
+
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
*[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S50 CALLFRIEND Farsi (speech)]
+
*[http://nora.hd.uib.no/icame.html ICAME]
*[http://ece.ut.ac.ir/dbrg/hamshahri/ Hamshahri corpus]
+
*[http://pie.usna.edu/ Phrases in English]
*[http://www.elda.org/catalogue/en/speech/S0112.html Persian speech database Farsdat]
+
*[http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps Restricted English Corpus from Dr. Caroline Lyon for PhD]
 
+
*[http://clix.to/davidlee00 The BNC Index (for the BNCWorld Edition)]
==Russian==
+
-->
<!-- Please keep this list in alphabetical order -->
 
 
 
*[http://bokrcorpora.narod.ru Bokr Russian Reference Corpus]
 
*[http://www.slav.helsinki.fi/hanco/index_en.html HANCO: The Helsinki annotated corpus of Russian texts]
 
*[http://www.sfb441.uni-tuebingen.de/b1/korpora.html Russian Corpora]
 
*[http://rykov-cl.narod.ru/r.html Russian Corpora]
 
*[http://lib.ru/ Russian Corpus Site]
 
*[http://www.ruscorpora.ru/ The Russian National Corpus]
 
*[http://www.philol.msu.ru/~lex/corpus/ Russian Newspaper Corpus]
 
*[http://schools.keldysh.ru/uvk1838/Sciper/volume2/langres/russiclr.htm Russicon Resources]
 
 
 
==Slovak==
 
<!-- Please keep this list in alphabetical order -->
 
 
 
*[http://korpus.juls.savba.sk/index.en.html Slovak National Corpus]
 
 
 
==Italian==
 
<!-- Please keep this list in alphabetical order -->
 
 
 
*[http://languageserver.uni-graz.at/badip/badip/20_corpusLip.php LIP - Lessico di frequenza dell'Italiano Parlato - Access via BADIP]
 
*[http://www.istc.cnr.it/material/database/colfis/ ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto]
 
*[http://corpus.cilta.unibo.it:8080/coris_ita.html Corpus di Italiano Scritto contemporaneo (CORIS/CODIS)]
 
*[http://tlio.ovi.cnr.it/TLIO/ Tesoro della lingua italiana delle origini (TLIO)]
 
  
 
==Link collections==
 
==Link collections==
Line 99: Line 64:
 
*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
 
*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
 
*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
 
*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
*[http://www.alphabit.net Isabella Chiari: Corpora, Software and Linguistic resources]
 
 
*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]
 
*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]
  
Line 105: Line 69:
 
<!-- Please keep this list in alphabetical order -->
 
<!-- Please keep this list in alphabetical order -->
  
 +
*[http://corpus-tools.org/annis/ ANNIS] - open source search tool for complex multilayer corpora
 
*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
 
*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
 
*[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer
 
*[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer
Line 110: Line 75:
 
*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]
 
*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]
  
==Uncategorized==
 
<!-- Please keep this list in alphabetical order -->
 
 
===Arabic===
 
*[http://www.ldc.upenn.edu/Catalog/LDC2001T55.html Arabic Newswire Part 1]
 
===Bosnian===
 
*[http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian Texts]
 
===Bulgarian===
 
*[http://www.hf.uio.no/easteur-orient/bulg/mat/ Corpus of spoken Bulgarian]
 
===Croatian===
 
*[http://riznica.ihjj.hr/en/ Croatian Language Corpus at the IHJJ]
 
===Czech===
 
*[http://ucnk.ff.cuni.cz/english/index.html Czech National Corpus]
 
===Danish===
 
*[http://korpus.dsl.dk/korpus2000/indgang.php Danish news corpus]
 
===English===
 
*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
 
*[http://www.cornelsen.de/international/ An Empirical Grammar of the English Verb System]
 
*[http://thetis.bl.uk/ BNC Online Service]
 
*[http://info.ox.ac.uk/bnc/ BRITISH NATIONAL CORPUS - WORLD EDITION]
 
===Finnish===
 
*[http://www.csc.fi/kielipankki/ Finnish text bank]
 
===French===
 
*[http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais]
 
===German===
 
*[http://www.coli.uni-sb.de/sfb378/negra-corpus/ A Syntactically Annotated Corpus of German Newspaper Texts]
 
*[http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)]
 
===Haitian Creole===
 
*[http://hometown.aol.com/mit2haiti/Index4.html HAITIAN CREOLE ELECTRONIC TEXTS]
 
===Italian===
 
*[http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers]
 
===Japanese===
 
*[http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html list of Japanese transitive - intransitive verb pairs]
 
===Polish===
 
*[http://korpus.pl/en/ IPI PAN Polish Corpus]
 
===Romanian===
 
*[http://www.cs.unt.edu/~rada/downloads.html Romanian NLP]
 
===Sanskrit===
 
*[http://sanskritlibrary.org/ Sanskrit Library]
 
 
===Slovenian===
 
*[http://nl.ijs.si/elan/#corpus Slovene-English Parallel Corpus]
 
===Spanish===
 
*[http://www.corpusdelespanol.org/ Corpus del Espanol]
 
*[http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular]
 
===Swahili===
 
*[http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en Helsinki Corpus of Swahili (HCS)]
 
----
 
 
 
*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html 2000 NIST Speaker Recognition Evaluation Corpus]
 
*[http://ixa.si.ehu.es/Ixa/resources/sensecorpus A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)]
 
*[http://odur.let.rug.nl/~vannoord/trees/ Alpino Treebank]
 
*[http://www.aot.ru/search1.html AOT]
 
*[http://pioneer.chula.ac.th/~awirote/ling/corpuslst.htm Corpus Resources (Chulalongkorn University, Thailand)]
 
*[ftp://ftp.cs.cornell.edu/pub/smart/cran/ Cranfield collection]
 
*[http://corpus.rae.es/creanet.html CREA]
 
*[http://www.eat.rl.ac.uk/ Edinburgh Associative Thesaurus (EAT)]
 
*[http://www.hum.uva.nl/~ewn EuroWordNet]
 
*[http://rali.iro.umontreal.ca/ Hansards Corpus - Searchable]
 
*[http://www.hcrc.ed.ac.uk/maptask/ HCRC Map Task Corpus XML annotations]
 
*[http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/ ICOPOST]
 
*[http://www.ims.uni-stuttgart.de/projekte/TC.html IMS Corpus Toolbox, Univ. of Stuttgart]
 
*[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ IMS Corpus Workbench (CWB)]
 
*[http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm International Corpus of Learner English]
 
*[http://www.ipds.uni-kiel.de/links/datenmaterial.en.html Kiel University's Institute on Phonetics and Speech Procesing]
 
*[http://www.nilc.icmc.usp.br/lacioweb Lacio Web Corpora]
 
*[http://www.vuw.ac.nz/llc/ LANGUAGE LEARNING CENTER - ACADEMIC CORPUS]
 
*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
 
*[ftp://ftp.cs.cornell.edu/pub/smart/med/ Medlars collection]
 
*[ftp://ftp.ox.ac.uk/pub/wordlists/ Miscellaneous Word Lists from Oxford University]
 
*[http://www.lpl.univ-aix.fr/projects/multext/ Multilingual Text Tools and Corpora]
 
*[http://www.census.gov/genealogy/names Name lists from US census]
 
*[http://www.di.fc.ul.pt/~ahb/nexing.htm Nexing Corpus]
 
*[http://www.cs.cmu.edu/web/books.html On-line books at CMU]
 
*[http://logos.uio.no/opus/ OPUS -- An Open Source Parallel Corpus]
 
*[http://elex.amu.edu.pl/~przemka/PICLE_search.php Polish subcorpus of the International Corpus of Learner English]
 
*[http://www.cirp.es/WXN/wxn/frames/proxectos.html Ramon Piero Center for Research]
 
*[http://about.reuters.com/researchandstandards/corpus/ Reuters Corpus]
 
*[http://www.ldc.upenn.edu/Catalog/LDC2001S97.html Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio]
 
*[http://www.ldc.upenn.edu/Catalog/LDC2001S99.html Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio]
 
*[http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)]
 
*[http://www.ucl.ac.uk/english-usage/ Survey of English Usage, University College, London]
 
*[http://www.icsi.berkeley.edu/real/stp/index.html Switchboard Transcription Project]
 
*[http://www.tractor.de/ TELRI Research Archive of Computational Tools and Resources]
 
*[http://childes.psy.cmu.edu/ The Childes Corpus - Children's language]
 
*[http://nora.hd.uib.no/index-e.html The CORPORA DataCenter (Norway)]
 
*[ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/ The Moby Corpus]
 
*[http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm The Sofie Treebank - A Parallel Treebank of North European Languages]
 
  
 
[[Category:Corpora|*]]
 
[[Category:Corpora|*]]

Latest revision as of 17:58, 2 September 2019

For languages other than English, see List of resources by language.

Free and Downloadable

Proprietary or Require Prior Permission


Link collections

Corpora tools