Corpora: Morphology and Word Length (was: Relatve text length)
    Mike Maxwell 
    maxwell at ldc.upenn.edu
       
    Fri Apr 26 13:36:55 UTC 2002
    
    
  
Damlon Davison writes:
>It may be obvious, but agglutinating languages
>tend to have longer words
--or at least the _average_ length of words in agglutinating languages tends
to be longer, which presumably is what is meant here.  Languages like
English that have substantial derivational morphology can have some long
words, but a glance at a text in an agglutinating language like Quechua will
show the difference in average length.
I suspect polysynthetic languages also have long word lengths, but whether
that's true on the average, or only of some words (verbs with incorporated
nouns, say), I don't know.  I've never looked at an extended text in such a
language.  And of course compounding can create long words (look at a German
text), and perhaps reduplication in languages that use whole-word
reduplication.
I suspect that another influence on word length is the phonology: words with
large phoneme inventories tend to have shorter words.  Does anyone have data
on this?  E.g. languages with large numbers of consonants (the Caucasus
region?), or languages with lots of tones (some Chinese languages--in
Romanized scripts, of course!, or Chinantec languages (Mexico)), as opposed
to languages like Hawai'ian, which is notorious for a small phoneme
inventory (around 13, as I recall) and long words.
Since there are at least two factors related to word length (morphology and
phonology), and several different factors within morphology, I wonder
whether anyone has experimented with automatic classification of
morphological type.  We're having a workshop at the ACL this summer on
morphology learning, but it ought to be able to get a rough idea of how many
affixes there are without learning the "entire" morphology.  Perhaps just
seeing how compressible a text is would give you some idea, or turning it
into a minimized FSA.
Finally, there is a big caveat: the length of a word depends very much on
orthographic decisions.  Are clitics written solid?  Compounds?
Written German has long 'words' because the compound nouns are written
solid.  If they were written with a space between the nouns, the word length
would become a lot shorter--not to mention how much easier it would be to
read.  I guess the original observation on this is by Mark Twain :-).
I have even heard of a language where the linguist who designed the
orthography decided to write a space between each morpheme, turning an
agglutinating language into an isolating language in the orthography!  (One
wonders how the written language will look after a generation or two.)
     Mike Maxwell
     Linguistic Data Consortium
     maxwell at ldc.upenn.edu
    
    
More information about the Corpora
mailing list