Corpora: Diacritics

Chris Gledhill cjg6 at st-andrews.ac.uk
Fri Apr 20 13:12:34 UTC 2001


Dear Corporists,

As I absently followed this discussion, my mind fished out a recollection of
the UNIbet system. As I remember, UNIbet used standard ASCII characters to
represent the main IPA symbols. Does anybody use it nowadays in corpus work?
At the time I discovered it, it came in as a useful way of getting IPA into
documents without fiddling with fonts or diacritics. Although I use
diacritics wherever I can, I believe that they will ultimately fall foul of
the QWERTY principle (AZERTY, in France) - that is, an arbitrary standard
(the QWERTY keyboard layout) becomes entrenched and later innovations simply
cannot remove it.
Anyway, I have no idea what happened to UNIbet, but I do know that it and
many other debates, reviews and articles on writing systems and standards of
literacy have been documented over the years in the peer-reviewed Simplified
Spelling Journal (founded by George Bernard Shaw - the previous editor was
Chris Upward of Aston University's German department). Worth looking at,
perhaps?

Best regards
Chris Gledhill (St Andrews, UK)



I was pleasantly surprised to see the discussion on diacritics revived.
Herewith a few comments:

*  It is sad to think there are linguists who are intent on eradicating
diacritics as so many flyspecks from various orthographies however
motivated this may be by myopic software that can't deal with anything
beyond ASCII 127.

* I have always had a soft spot for ornate alphabets and elegant
syllabaries, such as Georgian and Thai, and unabashedly
extend this aesthetic to the lowliest diacritic. The credit here goes
to the Jesuits for their strict training in Greek breathings, iota
subscripts and accents in my youth.

*  Aversions to diacritics seem ominously connected with a lack of
the same in English. It would be therapeutic if the linguistic
(imperialist) tables were turned and speakers of Hebrew or Arabic began
wondering out loud why we clutter English texts with all those vowels.
Then
again, I suspect the uniform structure of roots in those languages would
make
such a recommendation as relativistic as the Anglocentric admonitions at

issue on the list. My feeling is that we'll be ready for orthographic
hygiene
across languages as soon as the dispute over variant Klingon
orthographies has been resolved (http://www.kli.org/tlh/sounds.html) -
by the Klingons.

* For some interesting political or linguistic reason, German legal
texts in the EU (EUR-LEX database) use the digraphs ae, ue and oe
instead of a-, u- and o-umlaut. This is surprising inasmuch as umlauts
are reasonably "establishment" as diacritics go and none of the other
official languages seem to have adopted a comparable practice. The only
other forum where I have seen this convention is Eurosport, where one
sees Finnish surnames like Hämäläinen or Määttä rendered as
Haemaelaeinen, Maeaettae.

* Arguing that natives sometimes leave out diacritics and that the
latter are therefore probably dispensable strikes me as tantamount to
studying telegrams in English and concluding that the language could get

by without articles and prepositions.

*  With EU enlargement to include the Czech Republic, Estonia, Hungary,
Poland and Slovenia, the minds behind (and in front of) computers in
general and email programs in particular had better quit while they are
behind and learn to deal with diacritics.

* I once wrote a little program on a computer course that would take a
Finnish text and replace the double (i.e. long) consonants and vowels
with the corresponding single character and an acute accent (e.g.,
kaataa 'to pour' -> kátá) á la Hungarian vowel orthography. (Not
surprisingly, I had to design a new character set to get consonants with

the appropriate diacritic). Comparisons of input and output texts
revealed that such a reform would cut paper consumption by 10-15%.


Rich Foley
University of Lapland



More information about the Corpora mailing list