tagged corpora

Sun Jul 7 23:01:59 UTC 2002

Dear Info-CHILDES,
  In the context of some work I am doing, I ran the MOR program over the
entire normally-developing English CHILDES database and disambiguated the
resulting %mor line using POST.  The resultant files are on the web now at
http://childes.psy.cmu.edu/english.sit  There is a link called
"tagged-English" on the home page that points to that file.  It is 32 MB in
size and becomes 180 MB when expanded (that's a new record for compression
ratio, isn't it?), so be patient in downloading.
  My estimate is that the MOR line in these files is about 90% accurate.  We
have reached 95% accuracy for POST disambiguation in well-cleaned files.
Other files will surely have a lower level, but 90% is a reasonable guess.
This means that these data should only be used for analyses that are robust
against a certain level of tagging error.

--Brian MacWhinney