Corpora: Diacritics and "deviant" texts in corpora

Sun Apr 22 12:07:10 UTC 2001

At 22:39 21/4/01 +0100, Jem Clear wrote:
>I must urge Tadeusz Piotrowski **not** to standardize or
>normalize Polish e-mails or news agency feeds when
>adding them to his corpus.

I agree entirely with Jem's point, with an exception related
to zero-width characters (diacritics, vowels, etc.) in
Southeast Asian languages like Thai.  I don't know if
you have this problem in Europe.

  Around here, the entry application enforces a local interchange
standard on the order of such characters (usually it's 'store
as normally hand-written'; eg. vowels before tone-marks, and
no more than one of each).

  However, because the characters are zero-width, an input
application that isn't aware -- which can result from using
a keyboard manager with the standard international version
of the OS --  will permit both wrong orders, and multiple,
overwritten characters.  These appear correct on screen or
paper, but can't be searched properly.

  This is less of a headache for Thai, which has had an
interchange standard for some time, than for Lao, Khmer,
Burmese, etc.  In any case,  I clean up this kind of stuff
(ie. multiple or misordered diacritics) in corpus building,
but no more than to the point of making what the user
can search match what he or she sees.

  -- Doug Cooper