revised POST database
Brian MacWhinney
macw at cmu.edu
Mon Apr 17 01:33:44 UTC 2006
Dear Info-Chibolts,
I spent most of this week trying to tie up some loose ends in
the MOR analysis for English files. My first target was improving
the ability of the POST database to properly disambiguate between the
use of words like "have" and "is" and their corresponding contracted
forms as either auxiliaries or main verbs. Because of some errors I
had introduced to the Eve training corpus, these forms were being
overwhelming judged to be main verbs. After correcting the errors in
the training corpus, the bulk of these errors are now gone. I also
did some repairs to the training corpus for the disambiguation of the
word "to" as infinitive or preposition and for "like" as a verb,
preposition, or subordinating conjunction. And there were a variety
of additional minor fixes.
I then applied these fixes to the Manchester and Brown corpora
with good results. So far those are the only two corpora that have
been run through the new MOR and POST, but over the next weeks we
will redo the rest of the database.
If you are using any datasets actively, you will want to get these
better versions or you can even run the new MOR and POST on the
datasets for yourself.
It is very helpful to receive feedback regarding systematic
errors in MOR and POST. Errors occurring in incomplete two word
sentences are not too useful, but for longer sentences from either
the child or the adults, systematic reports of errors types can help
guide me in future repairs to the training corpus or the expansion of
the training corpus.
--Brian MacWhinney
More information about the Chibolts
mailing list