tagged corpora
Brian MacWhinney
macw at cmu.edu
Sun Jul 7 23:01:59 UTC 2002
Dear Info-CHILDES,
In the context of some work I am doing, I ran the MOR program over the
entire normally-developing English CHILDES database and disambiguated the
resulting %mor line using POST. The resultant files are on the web now at
http://childes.psy.cmu.edu/english.sit There is a link called
"tagged-English" on the home page that points to that file. It is 32 MB in
size and becomes 180 MB when expanded (that's a new record for compression
ratio, isn't it?), so be patient in downloading.
My estimate is that the MOR line in these files is about 90% accurate. We
have reached 95% accuracy for POST disambiguation in well-cleaned files.
Other files will surely have a lower level, but 90% is a reasonable guess.
This means that these data should only be used for analyses that are robust
against a certain level of tagging error.
--Brian MacWhinney
More information about the Info-childes
mailing list