Corpora: Diacritics and "deviant" texts in corpora

Sun Apr 22 16:15:13 UTC 2001

I will assume that the questions I ask Doug are of general interest in the
spirit of corpus linguistics principles of raw data availability pointed
out by Jem. I hope the assumption is not misguided. Most of the questions
are likely to be a result of ignorance regarding the languages in
question or regarding computing.

On Sun, 22 Apr 2001, Doug Cooper wrote:

> At 22:39 21/4/01 +0100, Jem Clear wrote:
> >I must urge Tadeusz Piotrowski **not** to standardize or
> >normalize Polish e-mails or news agency feeds when
> >adding them to his corpus.
>
> I agree entirely with Jem's point, with an exception related
> to zero-width characters (diacritics, vowels, etc.) in
> Southeast Asian languages like Thai.  I don't know if
> you have this problem in Europe.
>

Now, could you be a little more precise about the meaning of a "zero-width
vowel character" ? I guess I understand a "zero-width diacritic character"
because, if I got you right, that's what happens in Portuguese and
Spanish, but a zero-width vowel ? Is it anything like a vowel character
that signals length ? Or perhaps some form of composite vowel sound that
is not a diphtong ?

> Around here, the entry application enforces a local interchange
> standard on the order of such characters (usually it's 'store
> as normally hand-written'; eg. vowels before tone-marks, and
> no more than one of each).
>

That means you may have a sequence of characters such as:

1. first a character that stands for a vowel
2. then a character that signals vowel length
3. then a character that marks tone

Is that it in normally hand-written text ?

>   However, because the characters are zero-width, an input
> application that isn't aware -- which can result from using
> a keyboard manager with the standard international version
> of the OS --  will permit both wrong orders, and multiple,
> overwritten characters.  These appear correct on screen or
> paper, but can't be searched properly.
>

I assume this means that there is underlying code to make characters
appear on screen and paper correctly, but that this code plays havoc with
searches ? Does "correct" mean "normally hand-written" ? I have problems
following your reasoning here, possibly as a result of ignorance, but
could you clarify the difference between correct and normally hand-written
?

>   This is less of a headache for Thai, which has had an
> interchange standard for some time, than for Lao, Khmer,
> Burmese, etc.  In any case,  I clean up this kind of stuff
> (ie. multiple or misordered diacritics) in corpus building,
> but no more than to the point of making what the user
> can search match what he or she sees.
>

Do you mean you clean up multiple or disordered diacritics that appear
correctly on screen and paper ? I'm afraid I'm lost here. As this is
potentially uninteresting to other members of the list, could you point
me to a site where there are explanations about this in English or
French ?

Marco Rocha