MOR grammar updates for Chinese and Italian
Brian MacWhinney
macw at cmu.edu
Sun Feb 18 11:16:50 UTC 2007
Dear Info-CHILDES,
I would like to report some recent changes in support for
morphological analysis for Mandarin, Cantonese, and Italian. The
basic goal here is to have complete and consistent automatic %mor
analysis for all languages in CHILDES.
For both Mandarin and Cantonese , we are dropping reliance on
romanizations in favor of reliance on Hanzi script, since Hanzi is
much less ambiguous than romanizations.
For Cantonese, this has meant revising the corpora to place the
Chinese characters on the main line. After that, I verified that all
of the words in the corpora were recognized by the Cantonese MOR.
The final step involves building a training corpus for automatic
disambiguation using POST. That step should be complete in the next
few weeks. Once that is done, I will add %mor lines to the Cantonese
corpora.
For Chinese (Mandarin), the work of building a training corpus was
done by Twila Tardif and her students, along with earlier help from
Chienju Chang. Using that training corpus, I have now succeeded in
creating an unambiguous %mor line for the Chang and Zhou corpora. I
have also placed the Beijng and Context corpora into a romanization
form that will allow us to eventually conduct a full automatic MOR
analysis.
For Italian, I revised the MOR grammar to provide full analysis of
all words in the Tonelli corpus. Then, Livia Tonelli and Maurizio
Fabris constructed a training corpus and we used the resultant POST
disambiguator to provide a full %mor line for the Tonelli corpus.
I would like to encourage people working with Cantonese, Mandarin,
and Italian to make use of these highly functional new tools. To
make sure that your encodings are in line with the grammars currently
available, you need to occasionally run this command
mor +xl *.cha
This creates a thing called a minilex. The goal here is to have your
minilex file empty. If you run mor +xl and the resultant file is
empty, then you know that all of the forms you are entering in your
corpus are recognized by MOR and that part of speech analysis will be
fully automatic. If you find words in your minilex, then you can
either fix these words in the transcript or else add them to the MOR
grammar.
This would also be a great time to contribute to CHILDES any corpora
you have available in Chinese or Italian.
Good luck with the use of these tools,
--Brian MacWhinney
More information about the Info-childes
mailing list