6.1177, Sum: Phonemicity of writing

The Linguist List linguist at tam2000.tamu.edu
Tue Aug 29 21:11:37 UTC 1995


---------------------------------------------------------------------------
LINGUIST List:  Vol-6-1177. Tue Aug 29 1995. ISSN: 1068-4875. Lines:  299
 
Subject: 6.1177, Sum: Phonemicity of writing
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
 
Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: dizdar at tam2000.tamu.edu (Ann Dizdar)
 
---------------------------------Directory-----------------------------------
1)
Date:  Mon, 28 Aug 1995 23:17:34 EDT
From:  rws at research.att.com (Richard Sproat)
Subject:  Phonemicity of writing: Summary
 
---------------------------------Messages------------------------------------
1)
Date:  Mon, 28 Aug 1995 23:17:34 EDT
From:  rws at research.att.com (Richard Sproat)
Subject:  Phonemicity of writing: Summary
 
 
On August 13 I posted a query concerning the predictability of word
pronunciation from orthography. Specifically: in how many languages is
it the case that one can predict the phonemic representation of most
or all words --- including suprasegmental information such as tone,
accent or stress --- from the written forms of those words. I have
received a large number of responses, which I summarize below. Before
I do that however, I want to clear up a couple of terminological
issues, and also answer one question which several people asked:
 
1) THE TERM `PHONEMIC'.  As Wayles Browne correctly pointed out, my
use of the term `phonemic' here is somewhat idiosyncratic. The term
phonemic is usually used to denote a system where there is an exact
one-to-one correspondence between phonemes (again, including
phonemically distinctive suprasegmental features) and their graphemic
representation; as unequivocal example of this as one could find would
be a linguist's phonemic transcription. My query concerned only half
of that correspondence, namely the predictability in the
grapheme-to-phoneme direction. Thus Spanish orthography would not be
phonemic in the strictest sense since the symbol <h> corresponds to no
phoneme; similarly <b> and <v> are not phonemically distinct; however
the grapheme-to-phoneme correspondence in Spanish is pretty
predictable --- modulo some problems with the representation of glides
which Jose Ignacio Hualde reminded me of.
 
Suffice it to say that in all that follows, by `phonemic' I intend
this particular idiosyncratic sense.
 
2) THE TERM `GRAPHEME'. As Peter Daniels pointed out in a posted
response of August 14, while I was somewhat cautious in my use of the
term `phoneme', I took it for granted that the term `grapheme' has a
clear meaning. Naturally, I was aware that this is an equally
contentious term: even a standard dictionary gives more than one
definition. The sense in which I meant it, of course, was as a basic
symbol of an `alphabet', but this is an equally ill-defined concept,
since what counts as a basic symbol? In Chinese, for example, common
wisdom as well as traditional Sinological thought would say that the
character (hanzi) is the basic unit, but of course it is
well-understood that most Chinese characters have meaningful internal
structure, with the vast majority being constructed out of a
`semantic' radical and a `phonetic' component: it would be perfectly
reasonable to consider these sub-character components to be graphemes.
However, while the linguistic definition of grapheme is perhaps hard
to decide on, the internationalization of computer technology has at
least provided one consistent --- if often arbitrary --- definition,
and one which for my purposes (see (3) below) is perfectly
satisfactory: a grapheme is a basic unit of writing for which there is
a defined code in some accepted electronic coding scheme such as
ASCII, ISO-8859, BIG5, JIS or UNICODE. Thus single letters of English,
letters with or without accents in Spanish, katakana or hiragana
symbols in Japanese, hangul in Korean, or single Chinese characters in
Chinese, Japanese or Korean; all of these would count as graphemes for
my purposes since they have defined electronic codes in the
appropriate coding schemes.
 
3) WHY AM I INTERESTED IN THIS? I work on text-to-speech conversion,
specifically on the portion of text-to-speech systems which convert
from written text into linguistic representations, which include among
other things transcriptions of words into phonemic representation. It
is an issue of some practical concern then of how many languages are
`easy' from the point of view of grapheme-to-phoneme conversion and
how many are `hard' in the sense that they would require a lot of
lexical information in order to do a good job.  (More formally, as
Allan Wechsler suggested, one can couch the problem in terms of
Kolmogorov complexity: for a given language, what is the size of the
Turing machine required to convert from the orthographic form to the
phonemic form.)  Apart from that (though certainly motivated by that)
I am interested in general in how writing represents language.
 
- ----------------------------------------------------------------------
 
SUMMARY:
 
The vast majority of languages mentioned in the responses were
`phonemic' in the sense intended --- with one important caveat:
stress, accent and tone are typically NOT represented in the
orthography, even when they are phonemically distinctive. Of course
this underrepresentation of suprasegmental information can have
segmental import since, for example, stress placement often interacts
with glide formation, or (as in Russian) can have serious effects on
vowel quality.
 
The following are the languages, or groups of languages, for which I
received responses, along with a quick summary of their `phonemicity'
along with information on particular problems. (It should be clear
that the following descriptions are my own distillation of what the
various respondents sent me.)
 
CHEYENNE: Phonemic: developed in the early 1970s. (Dan Alford)
 
RUSSIAN: Fairly phonemic but lack of stress marks and failure to write
e-with-two-dots (= /jo/, or /o/ with a preceding palatal consonant) in
most texts is a problem. (Wayles Browne). I can also add that there
are a large number of partly-assimilated foreign words where
orthographic vowels which normally represent members of the
palatalizing series represent instead members of the non-palatalizing
series. For example, <e>, which normally represents /je/, or /e/ with
a preceding palatal consonant, represents instead /e/ with a preceding
non-palatal consonant: Examples include `model', `test'. There are
quite a number of other subtle irregularities.
 
BELARUSIAN: Fairly phonemic (more so than Russian), but the lack of
stress-marks is still a problem. (Wayles Browne)
 
UKRAINIAN: Fairly phonemic (more so than Russian), but the lack of
stress-marks is still a problem. (Wayles Browne)
 
POLISH: Phonemic (except for problems with loanwords). (Wayles
Browne, Peter Paul)
 
CZECH:  Almost completely phonemic. (Wayles
Browne, James Kirchner)
 
SLOVAK: Basically phonemic. (Wayles Browne)
 
SLOVENIAN: Fairly phonemic, but stress is not marked, and neither is
schwa, nor the distinction between open/closed /e/ and open/closed
/o/. (Wayles Browne)
 
SERBO-CROATIAN: Highly phonemic, except that accent position and type
are not marked, and neither is vowel length. (Wayles Browne, Allan
Wechsler)
 
MACEDONIAN: Phonemic. (Wayles Browne)
 
BULGARIAN: Fairly phonemic (more so than Russian), but the lack of
stress-marks is still a problem. (Wayles Browne)
 
GHANAIAN LANGUAGES: Of the 60-odd languages spoken in Ghana, about
half have established writing systems. Of these, many are phonemic at
the segmental level, but almost none of them represent tone, even
though this is distinctive. Some of them (e.g. Chumburung) do mark
tone in restricted circumstances, e.g. to distinguish otherwise
homophonous morphemes. An additional complication is that many of
these languages have 9-vowel systems but usually only seven vowel
symbols are used: exceptions are Nawuri and Gidire (Adele), which
represent all nine. Nine-vowel systems all have ATR harmony and in
many cases the vowel quality is thus predictable from the orthographic
representation (via the ATR harmony constraints), though not
always. The reason for underrepresenting both tone and vowels is
probably due to the precedent set by the orthographies of the more
widely spoken southern Ghanaian languages such as Akan, Ewe and
Ga. (Note that Akan actually has nine vowels, but only seven vowel
symbols.) A related and more fundamentally practical reason is that
most typewriters used in Ghana, while having the symbols epsilon and
backwards-c (used in Akaan), lacked other special symbols. One final
point is that the information-load carried by tone is relatively low
in many of these languages: there are relatively few minimal pairs
that are distinguished solely by tone.  (Rod Casali)
 
URDU has a larger number of vowels than can be encoded with the Arabic
script, and hence several phonemes are left unwritten. (John Coleman)
 
BAMBARA and DOGON have phonemic segmental representations, but like
other West African languages, do not represent tone. (Chris Culy)
 
THAI is phonemic in the direction of interest, though the rules for
predicting tone from orthography are exceedingly complex. (Victor
Gaultney, John Kingston) (Note though that Thai, like Chinese and
Japanese does not mark word-boundaries in writing, which tends to make
text-processing more complex.)
 
HUNGARIAN: Phonemic, except that written <e> is used for two different
phonemes which educated speakers tend to distinguish. The graphemic
sequence <szs> is a potential problem since it may be parsed as <sz.s>
/sS/ (S = /sh/ as in `ship') or as <s.zs> /S3/ (3 = /zh/ as in
`leisure'), but the latter is rare, and morphological information can
disambiguate the cases. The word for `one' (and its derivatives) ---
`egy' (where <gy> = /barred-j/) --- is (irregularly) pronounced with a
geminate /barred-j/, which would regularly be written <ggy>.
(halasz at kewszeg.norden1.com, Peter Szigetvari)
 
GREEK: Nearly completely phonemic. (Stavros Macrakis)
 
TURKISH: Almost completely phonemic. Note that modern Turkish
orthography, based on the Roman alphabet, dates only to the
1920's. (Stavros Macrakis, Inci Ozkaragoz, Steve Seegmiller)
 
FINNISH is completely phonemic. (Victor Gaultney, Deborah Ruuskanen)
 
KOREAN (as written in hangul) is almost completely phonemic, though
vowel length (the modern manifestation of pitch accent in some
dialects) is not marked.  For dialects other than Seoul Korean where
pitch accents still exist, one would need lexical information to
predict the pitch accent since this is not marked in the orthography.
(Bart Mathias, David Silva)
 
SIGNED LANGUAGES: There are various systems under development for the
graphic representation of signed languages, such as SignWriter
(promoted by the Deaf Action Committee of Southern California), which
are fairly phonemic in the representation of signs. (Cindy
Neuroth-Gimbrone)
 
AGUARUNA (JIVAROAN) (Peru) Phonemic, except that pitch accent
placement is not marked and is not always predictable (about 1/4 of
nouns have irregular accent placement), and a phonemic nasal/oral
contrast in vowels is not orthographically marked. (David Payne)
 
ASHENINKA (CAMPA). Entirely phonemic. (David Payne)
 
CATALAN.  Less phonemic than Spanish: open/closed distinction for /e/
and /o/ is not marked in the orthography. (Pilar Prieto)
 
HAWAIIAN. Phonemic. (Deborah Ruuskanen)
 
TURKIC LANGUAGES OF FORMER USSR (Azerbaijani, Turkmen, Uzbek, Kazakh,
Kyrgyz, Karachay ...) Basically phonemic, but there are some
complications where Cyrillic letters are used to represent both
Russian sounds, and strictly Turkic sounds: so the Cyrillic symbol for
"ju" is used to represent IPA /y/ (high front rounded vowel), but also
may be used to represent the sequence /ju/, as in Russian. Stress is
not marked and there are some exceptions to normal stress patterns.
In some cases, even non-phonemic distinctions are marked: in Karachay
the phoneme /g/ has fricative allophones, such as a voiced velar
fricative between back vowels, and these fricatives are written with
separate letters. (Clearly for my purposes this overspecification is
not a problem, and is even helpful.) (Steve Seegmiller)
 
WEST(ERLAUWER) FRISIAN: The orthography has undergone a number of
changes in the 1980s which tend to make it more phonemic. However
diphthongs remain a problem since falling diphthongs `break' into
rising diphthongs in certain morphological contexts, and this is not
reflected in the orthography; there are too many irregularities to
predict purely by rule. (Henk Wolf)
 
- ----------------------------------------------------------------------
 
ACKNOWLEDGMENTS: I would like to thank the following people for
sending me often quite detailed responses:
 
Dan Alford:                     dalford at s1.csuhayward.edu
Wayles Browne:                  ewb2 at cornell.edu
Andrew Carstairs-McCarthy:      a.c-mcc at ling.canterbury.ac.nz
Rod Casali                      IZZYPF9 at MVS.OAC.UCLA.EDU
John Coleman                    John.Coleman at Phonetics.Oxford.ac.uk
Chris Culy                      chris-culy at uiowa.edu
Victor Gaultney                 victor.gaultney at sil.org
- -                             halasz at kewszeg.norden1.com
Jose Ignacio Hualde             jihualde at ux1.cso.uiuc.edu
John Kingston                   KINGSTON at coins.cs.umass.edu
James Kirchner                  JPKIRCHNER at aol.com
Duncan MacGregor                aa735 at freenet.carleton.ca
Stavros Macrakis                macrakis at osf.org
Bart Mathias                    mathias at hawaii.edu
Cindy Neuroth-Gimbrone          cng9 at vivanet.com
Inci Ozkaragoz                  IOZKARAGOZ at firstbyte.davd.com
Peter Paul                      Peter.Paul at arts.monash.edu.au
David Payne                     dpayne at gower.net
Pilar Prieto                    prieto at research.att.com
Deborah Ruuskanen               druuskan at cc.helsinki.fi
Steve Seegmiller                SEEGMILLER at apollo.montclair.edu
David Silva                     david at utafll.uta.edu
Peter Szigetvari                szigetva at osiris.elte.hu
Allan Wechsler                  Wechsler at world.std.com
Henk Wolf                       H.A.Y.Wolf at stud.let.ruu.nl
 
Thanks also to Peter Daniels for the pointer to the upcoming "World's
Writing Systems", which I expect will put these kinds of issues in a
more systematic light than heretofore.
 
- --
 
Richard Sproat
Linguistics Research Department
AT&T Bell Laboratories                  | tel (908) 582-5296
600 Mountain Avenue, Room 2d-451        | fax (908) 582-7308
Murray Hill, NJ 07974, USA              | rws at research.att.com
 
------------------------------------------------------------------------
LINGUIST List: Vol-6-1177.



More information about the LINGUIST mailing list