an elegant solution

Wed Aug 25 19:16:48 UTC 1999

Dear Info-CHILDES,
  Mary MacWhinney just now pointed out to me an extremely elegant and simple
solution to the problem of computing MLU on raw CHAT files.  Currently, there
are three recommended approaches to calculating MLU.

1.  The first approach uses the earlier method devised by Miller and Chapman
in SALT.  It relies on "main line morphemicization" of words.  In this
method, "shoes" becomes "shoe-s" and "John's" becomes "John-'s".  As many of
you have learned, this method has many limitations.  It is difficult to know
how to segment "lent", "gonna", or "can't" and the method becomes even more
problematic for languages other than English.

2.  To solve some of these problems, we introduced a second method of main
line morphemicization using replacement symbols.  In this method, you can
have "gonna [: go-ing to]" and then you can perform one count that is
analytic and one that is non-analytic.  This is more consistent, but it is a
lot of work.

3.  Finally, you can construct a complete %mor line for the file.  This is
the best solution, but also requires the most work.

Mary's "new" solution is the following.  You first run MLU on the file or the
collection of file to get the number of utterances.  Then you run FREQ on the
file or the collection of files to get the complete frequency listing.  You
take the summed frequency and then go through word by word and decide whether
each word has one, two, or three morphemes.  If it has three, you double the
count for that word and add it to the grand total.  And so on for all the
words that have more than one morpheme.

After mentioning this to a colleague, she said that she had already done
something like this on her own.  So, perhaps, I am the only person left in
the child language world who has not already figured this one out.  But
perhaps not.

This may sound like a tedious method, but actually it is not that bad and it
can be significantly more efficient than the three alternatives, at least for
most Indo-European languages.  Think about it.

--Brian MacWhinney