error coding

Mon Apr 17 02:11:40 UTC 2006

Dear Info-ChiBolts,

     From 1999 to 2002, we reformatted the database to comply with  
XML, while also shifting the responsibility for MLU and morphological  
analysis from the main line to the %mor line.  In the process of  
making these changes, problems arose with tthe marking of  
moprhological errors in a few of the corpora, including particularly  
the Brown and Manchester corpora.  No data was lost, but the codes  
were reconfigured in a way that made searching more difficult.   In  
the Brown corpus, the system for marking morphological errors relied  
primarily on the special form marker @n as in goed at n.  This coding  
was basically a relic of CHAT from 1986.  In the Manchester corpus,  
the coding originally used the form go-ed [*] for overregularization  
and run-0ed for missing markings.  In order to remove the -0ed code,   
I transformed run-0ed to run [* 0ed].  Unfortunately, these  
transformations were not done consistently in either database.  This  
week, using a program that Leonid wrote for Manchester and some  
systematic searches in BBEdit, I brought these two corpora into a new  
systematic form.
    The new form is documented in section 7.5 of the CHAT manual.   
For the Brown corpus, the error marking uses categories like this:

Form        Function
+ed           past overregularization breaked broke
+ed-sup   superfluous –ed broked broke
+ed-dup   duplicated –ed breakeded broke
virr            verb irregularization bat bit
+es           present overregularization have has
+est          superlative overmarking most mostest
+er            agentive overmarking rubber rubberer
+s              plural overregularization childs children
+s-sup     superfluous plural childrens children
+s-pos     plural for wrong part of speech mines mine
pos           general part of speech error mine my
sem          general semantic error

These appear in this shape

breaked [: broke] [* +ed]

When MOR runs, it uses [: broke] to replace breaked.  This means that  
the %mor line does not reflect errors, but only the target  
morphology.  In order to search for these forms, you can use a couple  
of different forms.  If you want the
error codes themselves, you can use

freq +s”[\* *]” *.cha +u

If you want the material to which the error codes refer, you can use:

freq +s”<\* *>” *.cha +u

The only difference here is the use of the angle brackets in the  
second case.

This coding has been done quite systematically in Brown.  However,  
for that corpus the original markings mostly emphasized overmarkings  
and seldom indicated suffix omission.  In the Manchester corpus, on  
the other hand, there were consistent markings for both errors and  
omissions.  Now the omissions are quite clearly coded.  However, the  
errors have not be typologized quite as consistently yet as for the  
Brown corpus.

Over time, we will extend this system to all the other corpora, both  
for English and other languages.  In general, it seems to me that  
this new method for error marking is greatly superior to the earlier,  
more haphazard approaches.  The use of the %err line to code errors  
provides good space for commentary, but is difficult to process  
numerically.  The use of [*] on the main line without further codes  
is a good step, but adding the codes after the asterisk makes this  
much more systematic and powerful.  Finally, linking error coding to  
the use of a replacement string, as in [: broke] for breaked helps  
remove the burden of error analysis from the MOR program.

I would like to encourage all research teams producing new corpora to  
use this newer method.  Many thanks.

--Brian MacWhinney