New set of English tagged corpora
Brian MacWhinney
macw at cmu.edu
Tue Sep 23 22:19:13 UTC 2003
Dear Info-CHILDES,
I have now produced a new set of morphologically tagged and disambiguated
files for all of the English data, both from the USA and the UK. These
corpora can be found from the CHILDES data page at
http://childes.psy.cmu.edu/data/
Note the two links there labeled "tagged" -- one for the USA and one for the
UK.
Over the last year we have worked to make sure that all of the words in
all of these corpora are recognized by MOR. This work is now done for all
of the corpora with the exception of Manchester, Hall, Sachs, and Snow.
After MOR was run on these corpora, which took about 30 minutes, I then ran
the POST disambiguator which took another 30-40 minutes. Of course, the
real work here was the work involved in making sure that every word was
recognized by MOR in the first place. Once that was done, the rest was
easy.
The accuracy of the tagging seems quite high. It seems better than the
90% I calculated earlier, perhaps closer to 95%.
Having now nearly finished tagging English, we will probably turn our
attention back to finishing the MOR tagging for Spanish. Other languages
that should receive attention soon include Mandarin, Cantonese, and Italian,
since we have the beginnings of grammars and lexicons for them. I need to
clarify the status of tagging for French, Dutch, and German.
Good luck with these data and please send me questions, if you have any.
--Brian MacWhinney, CMU
More information about the Info-childes
mailing list