[Corpora-List] syllable corpora
Simon G. J. Smith
smithsgj at eee.bham.ac.uk
Tue Sep 24 18:10:52 UTC 2002
In English, there is no absolute consensus on where syllable boundaries lie, so syllabic segmentation isn't trivial.
That's not necessarily true of all languages, though; in Chinese, for example, each syllable is represented by one character in the writing system. What is contentious with this language is where the *word* boundaries lie!
So you might consider using a corpus of Chinese (for example, the CKIP corpus available from www.sinica.edu.tw ). I don't know if you'll find anything in romanized form, so you might need to enlist the help of a Chinese speaker, download Chinese reading software from www.unionway.com , and run the Chinese characters through a Pinyin (romanization) annotator, like http://www.all-day-breakfast.com/chinese/big5-simple.html.
Let me know how you get on if you try this.
More information about the Corpora
mailing list