[Corpora-List] Turkish Corpus
John Burger
john at mitre.org
Fri Oct 30 15:10:49 UTC 2009
> You may want to have a look at the following two corpora available
> from LDC:
>
> LDC2006S33 Middle East Technical University Turkish Microphone
> Speech v 1.0
> (140 speakers reciting 40 sentences each)
>
> LDC94T5 ECI Multilingual Text
> (173K words; primarily journalistic text)
One could also construct a corpus from the Turkish Wikipedia dumps, in
raw wikitext:
http://download.wikimedia.org/trwiki/20091027/
and/or rendered HTML (but this is over a year old):
http://static.wikipedia.org/downloads/2008-06/tr/
Depending on what and how you count, this is probably in excess of 50
million words.
- John D. Burger
MITRE
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list