Fixed
Leach, Diane (NIH/NICHD)
leachd at mail.nih.gov
Tue Sep 23 18:23:37 UTC 2003
Hi Brian et al.
I am very new to CHAT and CLAN, so I apologize in advance if my ignorance
shows. We have been using main line morphemization for the purpose of
computing MLU in morphemes in one set of data. Furthermore, I have been
exploring the possibility of converting SALT data to CHAT so that I could
analyze similar data in other languages like Dutch, Italian, Japanese,
French, Hebrew, etc. When I use the SALTIN procedure, it converts my SALT
transcripts into main line morphemicized CHAT data. This is all fine and
dandy except for one thing. When I try to get a count of the number of
different word roots spoken (using freq), CLAN is counting different forms
of the same word as different word types. See below:
4 bear
2 bear-s
1 because
4 can
3 can-'nt
1 cause
1 chair-s
1 children
1 clothe-s
1 come-ing
1 could
1 daddy
2 have
1 have-ing
1 he
1 he-'is
I tried adding *-* to the depfile and adding 'legacy' to the @Languages
line, but I'm still having the problem. Am I doing something wrong? Is
there an option to ignore the endings?
Thanks!
Diane
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
Diane B. Leach, Ph.D.
Statistician
Child and Family Research, NICHD, NIH
6705 Rockledge Drive, Suite 8030
Bethesda, MD 20892
301-496-6291 phone
301-496-2766 fax
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
-----Original Message-----
From: Brian MacWhinney [mailto:macw at cmu.edu]
Sent: Wednesday, September 17, 2003 10:11 PM
To: Nan Ratner; info-chibolts at mail.talkbank.org
Subject: Re: Fixed
Nan and Info-CHIBolts,
Nan Bernstein-Ratner sent me a note a few days ago complaining about how
CHECK is no longer allowing main line morphemicization. I think to have
full public discussion of this important issue, so here is my answer to Nan:
I very much understand this problem and it has been a very difficult
decision to take. I think that individual researchers have a different
perspective on this from the one I have derived over the last year as I have
really gone through every single file in the database. When I do this, I
have found an embarrassing divergence of morphemicization practices. The
divergences involve unreliable distinctions between analytic and
non-analytic affixes, decisions to separate off some affixes and not others,
inconsistent compound detection, misunderstanding of some conventions, and
errors in marker placement. Although a few researchers did a good job in
affix, clitic, and compound marking, the overall effect on the database has
been far too inconsistent. As a result, I think it is crucial to move
toward reliance on MOR and POST in order to maintain the scientific quality
of our work. I am not saying that MOR is perfect, but it is completely
predictable and the morphemicization decisions it makes are ones that are
fully open to public discussion. If we decide that something is wrong in
MOR, we can just change it and our work will then be directly improved.
At the same time, I recognize the pedagogical utility of the earlier
method of main line morphemicization, as long as data coded in this way are
not destined to be added to the database. Second, I realize that I have not
yet constructed a full account in the CLAN manual of how to compute MLU and
such from the %mor line. Third, there are several affixing languages for
which no MOR grammar is yet available and those languages will need the
older method still for some time. So, until I have completed all of this
work, it makes sense to continue with the older approach for pedagogical
purposes. So, if you want to use the older approach, here is how. First,
in the newly required @Languages line, which follows right about the @Begin,
put this:
@Languages: en, legacy
The "en" is the universal ISO code for English. The term "legacy" means
specifically that old style morpheme delimiters are allowed by CHECK.
Second, in the depfile, add *-*
Then, believe it or not, CHECK will be happy and everything will work as
before. But please explain to your students that they will eventually need
to learn to run programs like MLU from the %mor line.
Also, please note that there are morphemicized versions of the English
corpora on the server that can be reliably used to compute MLU and other
features from the %mor line.
Please comment on these issues and tell me if you run into any snags.
--Brian MacWhinney
More information about the Chibolts
mailing list