Fixed

Brian MacWhinney macw at cmu.edu
Thu Sep 18 02:10:34 UTC 2003


Nan and Info-CHIBolts,

  Nan Bernstein-Ratner sent me a note a few days ago complaining about how
CHECK is no longer allowing main line morphemicization.  I think to have
full public discussion of this important issue, so here is my answer to Nan:

  I very much understand this problem and it has been a very difficult
decision to take.  I think that individual researchers have a different
perspective on this from the one I have derived over the last year as I have
really gone through every single file in the database.  When I do this, I
have found an embarrassing divergence of morphemicization practices.  The
divergences involve unreliable distinctions between analytic and
non-analytic affixes, decisions to separate off some affixes and not others,
inconsistent compound detection, misunderstanding of some conventions, and
errors in marker placement.  Although a few researchers did a good job in
affix, clitic, and compound marking, the overall effect on the database has
been far too inconsistent.  As a result, I think it is crucial to move
toward reliance on MOR and POST in order to maintain the scientific quality
of our work.  I am not saying that MOR is perfect, but it is completely
predictable and the morphemicization decisions it makes are ones that are
fully open to public discussion.  If we decide that something is wrong in
MOR, we can just change it and our work will then be directly improved.

  At the same time, I recognize the pedagogical utility of the earlier
method of main line morphemicization, as long as data coded in this way are
not destined to be added to the database.  Second, I realize that I have not
yet constructed a full account in the CLAN manual of how to compute MLU and
such from the %mor line.  Third, there are several affixing languages for
which no MOR grammar is yet available and those languages will need the
older method still for some time.  So, until I have completed all of this
work, it makes sense to continue with the older approach for pedagogical
purposes.  So, if you want to use the older approach, here is how.  First,
in the newly required @Languages line, which follows right about the @Begin,
put this:

@Languages: en, legacy

The "en" is the universal ISO code for English.  The term "legacy" means
specifically that old style morpheme delimiters are allowed by CHECK.

Second, in the depfile, add *-*

Then, believe it or not, CHECK will be happy and everything will work as
before.  But please explain to your students that they will eventually need
to learn to run programs like MLU from the %mor line.

Also, please note that there are morphemicized versions of the English
corpora on the server that can be reliably used to compute MLU and other
features from the %mor line.

Please comment on these issues and tell me if you run into any snags.

--Brian MacWhinney



More information about the Chibolts mailing list