[Corpora-List] WANTED: Thai word-segmented corpora

Sun Aug 18 09:15:41 UTC 2002

At 14:53 17/8/02 +0200, Petr Sojka wrote:
>I am looking for word-segmented corpora of Thai.
>So far, I've found only Orchid corpus, but it is too small for
>our machine language research.

Dear Petr:

I see (or get) queries like this often enough to motivate the
following general comment (apologies in advance if I've
jumped to an incorrect conclusion about your goals).  Thai
appears to attract programming interest because it uses:

  a) a non-segemented writing system that
  b) has lots of text in electronic form available, and
  c) uses nice, straigtforward, one-byte encoding, and yet is
  d) so foreign that segmenting problems are not obvious ;-).

  I also see a steady stream of papers titled 'Yet Another
Segmentor/Hyphenator/Syllabifier for Thai,' all of which
use data sets like the Orchid Corpus for both training and
testing, and which usually report 94-97% success.

  Folks getting into this area should be advised that:

  a) beyond the trivial cases (and despite what's taught in Thai
grammar schools;-), there is no general agreement on how
written Thai should be segmented into words;

  b) corpora like Orchid tend to be skewed by their developers'
opinions on the subject, and/or to have been automatically
generated by systems that use similar corpora for training;

  c) thus, using them as gold standards won't teach very much.

  IMHO, linguistic research that requires segmented Thai data
(and by implication Lao, Burmese, and Khmer) is likely to remain
in its present rut until the focus shifts to some form of phrase
bracketing, rather than segmentation.

  Good luck,
  Doug Cooper