MOR grammar updates for Chinese and Italian

Sun Feb 18 11:16:50 UTC 2007

Dear Info-CHILDES,

    I would like to report some recent changes in support for  
morphological analysis for Mandarin, Cantonese, and Italian.  The  
basic goal here is to have complete and consistent automatic %mor  
analysis for all languages in CHILDES.

For both Mandarin and Cantonese , we are dropping reliance on  
romanizations in favor of  reliance on  Hanzi script, since Hanzi is  
much less ambiguous than romanizations.

For Cantonese, this has meant revising the corpora to place the  
Chinese characters on the main line.  After that, I verified that all  
of the words in the corpora were recognized by the Cantonese MOR.   
The final step involves building a training corpus for automatic  
disambiguation using POST.  That step should be complete in the next  
few weeks.  Once that is done, I will add %mor lines to the Cantonese  
corpora.

For Chinese (Mandarin), the work of building a training corpus was  
done by Twila Tardif and her students, along with earlier help from  
Chienju Chang.  Using that training corpus, I have now succeeded in  
creating an unambiguous %mor line for the Chang and Zhou corpora.  I  
have also placed the Beijng and Context corpora into a romanization  
form that will allow us to eventually conduct a full automatic MOR  
analysis.

For Italian, I revised the MOR grammar to provide full analysis of  
all words in the Tonelli corpus.  Then, Livia Tonelli and Maurizio  
Fabris constructed a training corpus and we used the resultant POST  
disambiguator to provide a full %mor line for the Tonelli corpus.

I would like to encourage people working with Cantonese, Mandarin,  
and Italian to make use of these highly functional new tools.  To  
make sure that your encodings are in line with the grammars currently  
available, you need to occasionally run this command

mor +xl *.cha

This creates a thing called a minilex.  The goal here is to have your  
minilex file empty.  If you run mor +xl and the resultant file is  
empty, then you know that all of the forms you are entering in your  
corpus are recognized by MOR and that part of speech analysis will be  
fully automatic.  If you find words in your minilex, then you can  
either fix these words in the transcript or else add them to the MOR  
grammar.

This would also be a great time to contribute to CHILDES any corpora  
you have available in Chinese or Italian.

Good luck with the use of these tools,

--Brian MacWhinney