error coding
Brian MacWhinney
macw at cmu.edu
Mon Apr 17 02:11:40 UTC 2006
Dear Info-ChiBolts,
From 1999 to 2002, we reformatted the database to comply with
XML, while also shifting the responsibility for MLU and morphological
analysis from the main line to the %mor line. In the process of
making these changes, problems arose with tthe marking of
moprhological errors in a few of the corpora, including particularly
the Brown and Manchester corpora. No data was lost, but the codes
were reconfigured in a way that made searching more difficult. In
the Brown corpus, the system for marking morphological errors relied
primarily on the special form marker @n as in goed at n. This coding
was basically a relic of CHAT from 1986. In the Manchester corpus,
the coding originally used the form go-ed [*] for overregularization
and run-0ed for missing markings. In order to remove the -0ed code,
I transformed run-0ed to run [* 0ed]. Unfortunately, these
transformations were not done consistently in either database. This
week, using a program that Leonid wrote for Manchester and some
systematic searches in BBEdit, I brought these two corpora into a new
systematic form.
The new form is documented in section 7.5 of the CHAT manual.
For the Brown corpus, the error marking uses categories like this:
Form Function
+ed past overregularization breaked broke
+ed-sup superfluous –ed broked broke
+ed-dup duplicated –ed breakeded broke
virr verb irregularization bat bit
+es present overregularization have has
+est superlative overmarking most mostest
+er agentive overmarking rubber rubberer
+s plural overregularization childs children
+s-sup superfluous plural childrens children
+s-pos plural for wrong part of speech mines mine
pos general part of speech error mine my
sem general semantic error
These appear in this shape
breaked [: broke] [* +ed]
When MOR runs, it uses [: broke] to replace breaked. This means that
the %mor line does not reflect errors, but only the target
morphology. In order to search for these forms, you can use a couple
of different forms. If you want the
error codes themselves, you can use
freq +s”[\* *]” *.cha +u
If you want the material to which the error codes refer, you can use:
freq +s”<\* *>” *.cha +u
The only difference here is the use of the angle brackets in the
second case.
This coding has been done quite systematically in Brown. However,
for that corpus the original markings mostly emphasized overmarkings
and seldom indicated suffix omission. In the Manchester corpus, on
the other hand, there were consistent markings for both errors and
omissions. Now the omissions are quite clearly coded. However, the
errors have not be typologized quite as consistently yet as for the
Brown corpus.
Over time, we will extend this system to all the other corpora, both
for English and other languages. In general, it seems to me that
this new method for error marking is greatly superior to the earlier,
more haphazard approaches. The use of the %err line to code errors
provides good space for commentary, but is difficult to process
numerically. The use of [*] on the main line without further codes
is a good step, but adding the codes after the asterisk makes this
much more systematic and powerful. Finally, linking error coding to
the use of a replacement string, as in [: broke] for breaked helps
remove the burden of error analysis from the MOR program.
I would like to encourage all research teams producing new corpora to
use this newer method. Many thanks.
--Brian MacWhinney
More information about the Chibolts
mailing list