[Corpora-List] Turkish Corpus

John Burger john at mitre.org
Fri Oct 30 15:10:49 UTC 2009


> You may want to have a look at the following two corpora available  
> from LDC:
>
> LDC2006S33 Middle East Technical University Turkish Microphone  
> Speech v 1.0
> (140 speakers reciting 40 sentences each)
>
> LDC94T5 ECI Multilingual Text
> (173K words; primarily journalistic text)

One could also construct a corpus from the Turkish Wikipedia dumps, in  
raw wikitext:

   http://download.wikimedia.org/trwiki/20091027/

and/or rendered HTML (but this is over a year old):

   http://static.wikipedia.org/downloads/2008-06/tr/

Depending on what and how you count, this is probably in excess of 50  
million words.

- John D. Burger
   MITRE


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list