[Corpora-List] syllable corpora

Simon G. J. Smith smithsgj at eee.bham.ac.uk
Tue Sep 24 18:10:52 UTC 2002


In English, there is no absolute consensus on where syllable boundaries lie, so syllabic segmentation isn't trivial.

That's not necessarily true of all languages, though; in Chinese, for example, each syllable is represented by one character in the writing system. What is contentious with this language is where the *word* boundaries lie!

So you might consider using a corpus of Chinese (for example, the CKIP corpus available from www.sinica.edu.tw ). I don't know if you'll find anything in romanized form, so you might need to enlist the help of a Chinese speaker, download Chinese reading software from www.unionway.com , and run the Chinese characters through a Pinyin (romanization) annotator, like http://www.all-day-breakfast.com/chinese/big5-simple.html.

Let me know how you get on if you try this.



More information about the Corpora mailing list