Corpora: Diacritics and "deviant" texts in corpora

Tue Apr 24 08:27:27 UTC 2001

At 11:15 22/4/01 -0500, Marco Antonio Esteves da Rocha <marcor at cce.ufsc.br>
wrote:
>I will assume that the questions I ask Doug are of general interest.
Well, these are general problems for preparing corpora in writing
systems that derive from old Indian scripts, and which are read
as syllables, rather than in linear, left-to-right progression.

>Now, could you be a little more precise about the meaning of a "zero-width
>vowel character" ?
  Many Southeast Asian (and Indian) writing systems put some
vowels over or under the consonant.  A tone mark (or other diacritic)
can go over that.  A computer font gives such characters zero width.

>I assume this means that there is underlying code to make characters
>appear on screen and paper correctly, but that this code plays havoc with
>searches ?
  Not exactly.  Because the vowel or diacritc has zero width,
sequences like:

   consonant vowel tone mark
   consonant tone-mark vowel
   consonant tone-mark tone-mark tone-mark tone-mark vowel

can all have the exact same display (the repeated characters just
overwrite each other).  However, they're obviously not identical
for searching.

  A language-aware system enforces an interchange standard
that is usually something like this:

 - a diacritic must be used in conjunction with a consonant,
 - no more than one over- or under-vowel per consonant,
 - no more than one tone mark etc per consonant,
 - the sequence vowel -> tone-mark is legal, but tone-mark -> vowel
   is not.

These are the practices people generally follow when they write
with a pencil.
   As a rule, the input application - not the display app - enforces
this part of the interchange standard.  The issue in preparing a
corpus is, in effect, to simulate re-input, and make the saved text
obey the interchange standard.

  --Doug