Corpora: Diacritics, Unicode

Tadeusz Piotrowski tadpiotr at plusnet.pl
Sat Apr 21 09:27:20 UTC 2001


This thread seems to be getting boring for some people, but just a comment.
I would like to suggest that the weakening of the position of diacritics in
written language, as seen in the relucatance to use them in email, at least
in some languages, like Polish, comes also from the fact that they no longer
reflect contemporary speech. The Polish 'nasal' vowels in fact no longer
exist (indicated by diacritics on 'a' and 'e'), children have to be taught
to use the relevant graphemes, as they pronounce other vowels or other
phoneme sequences than indicated by the characters. This is of course a
common problem but it shows where the wish to get rid of all diacritics
originates.
What is more interesting, I feel, is what we do with e-mail texts in corpus
building. Should those diacritic-less texts be treated as deviant and,
consequently,  standardized/normalized, or should the lack of diacritics be
treated as a distinctive feature of this particular type of text?
Worse still, a news bulletin is sent by a Polish Press Agency (free) to all
interested parties, and there was a long period when it was diacritic-less.
The bulletin is nice, has lots of interesting words. Again: deviant before
the insertion of the diacritics? Or an interesting feature of this text?
And so on and so on.
As for Unicode: Tony McEnery has shown that it does not cope satisfactorily
with a number of languages of India (with non-Latin alpabets).
Regards

Tadeusz Piotrowski
***************************************************************
                                              mailing address
Department of English
Opole University                    Chrobrego 20
Oleska 48                              PL-55-020 Zorawina (Zórawina)
Opole
POLAND
              phone/fax (+48)71-3165847
              mobile (+48)607159263



More information about the Corpora mailing list