[Corpora-List] "errors and the art of correcting"

Sat Nov 11 16:01:28 UTC 2006

Hi Pete
I worked with Diane on the SENSEVAL project, where six experienced 
lexicographers tried
to assign corpus examples to dictionary senses. There were substantial 
disagreements. I suspect
the same would be true if you asked 6 teachers to correct the same 
learner texts. I have done
a considerable amount of editing/correcting of native-speaker-English 
and non-native-speaker English,
from draft academic papers by very senior colleagues, to PhD theses, to 
undergraduate essays, and there
are very few 'errors' (even what you might consider to be "gross 
mechanical errors of learner English")
that I couldn't suggest alternative corrections for, depending on where 
the error is deemed to be located.
Talking to my "correctees" afterwards usually makes me realize that I 
quite frequently misread their intentions,
and therefore offered the wrong corrections/suggestions.
Best
Ramesh

Pete Whitelock wrote:
> Re: papers written by English language learners *and the corrected 
> versions with interference problems removed. *
> I don't know about removal of interference problems in their entirety, 
> but since it's perfectly possible for a teacher to correct the gross 
> mechanical errors of learner English (wrong or missing articles, wrong 
> participles etc.), it would be possible for an annotator to do the 
> same. And in fact, CUP have 11m words of Cambridge Suite Examination 
> (KET, PET, FCE, CAE, CPE) scripts marked up in this fashion, mostly by 
> Diane Nicholls, and there's no cause for Ramesh to be disturbed by it.
> Pete
>
>     ------------------------------------------------------------------------
>     *From:* owner-corpora at lists.uib.no
>     [mailto:owner-corpora at lists.uib.no] *On Behalf Of *Ramesh
>     Krishnamurthy
>     *Sent:* 10 November 2006 23:41
>     *To:* Merle Tenney; CORPORA at UIB.NO
>     *Subject:* RE: [Corpora-List] Parallel corpora and word alignment,
>     WAS: American and British English spelling converter
>
>     Hi Merle,
>
>     Yes, I was aware of parallel corpora in 2 or more languages.
>     In fact, it's part of the corpus development we've initiated at Aston
>     (please see http://corpus.aston.ac.uk).
>
>     But it intrigued me to think of parallel corpora *within* a language.
>     I suppose dialectal texts rendered into "standard" language or
>     vice versa
>     might come close... I need to muse some more on this.
>
>>     Another variant on the parallel corpus theme is papers written by
>>     English language learners *and the corrected versions with
>>     interference problems removed. *
>     I'm not sure how this could be done without making huge intuitive
>     leaps as to what the 'errors' were,
>     and what the 'interference problems' were... I'm afraid a lot of
>     the error analysis I've seen leaves me
>     greatly disturbed....
>
>     Best
>     Ramesh
>
>
>     At 23:09 10/11/2006, Merle Tenney wrote:
>>     Ramesh,
>>
>>     Lots of people are working with parallel corpora in two or more
>>     languages. Honestly, I don’t know of any effort to acquire
>>     parallel corpora of two or more varieties of English, French,
>>     Portuguese, etc. I should think that sources for such corpora
>>     must exist, though not nearly to the extent that they exist for
>>     texts in different languages. Another variant on the parallel
>>     corpus theme is papers written by English language learners and
>>     the corrected versions with interference problems removed. Again,
>>     it is not hard to imagine that such sources exist, but I cannot
>>     provide a reference for either sort of same-language corpus. Can
>>     someone point Ramesh and me in the right direction?
>>
>>     Merle
>>
>>     *From:* Ramesh Krishnamurthy [ mailto:r.krishnamurthy at aston.ac.uk]
>>     *Sent:* Friday, November 10, 2006 6:46 AM
>>     *To:* Merle Tenney; Mark P. Line; CORPORA at UIB.NO
>>     *Subject:* Re: [Corpora-List] Parallel corpora and word
>>     alignment, WAS: American and British English spelling converter
>>
>>     Hi Merle
>>     I must admit I hadn't been thinking of "parallel" corpora along
>>     such strict-definition lines.
>>
>>     So who is creating large amounts of 'parallel' data (in the
>>     technical/translation sense)
>>     for British English and American English? I wouldn't have thought
>>     there was a very large
>>     market....?
>>
>>     Noah Smith mentioned Harry Potter, and I must admit I'm quite
>>     surprised to discover
>>     that publishers are making such changes as
>>
>>     They had drawn for the house cup
>>     They had tied for the house cup
>>     Perhaps because it's "children's" literature? Or at least read by
>>     many children,
>>     who may not be willing/able to cross varietal boundaries with
>>     total comfort.
>>
>>     But when I read a novel by an American author, I accept that it's
>>     part of my role as reader to
>>     take on board any varietal differences as part of the context. I
>>     can't imagine anyone wanting
>>     to translate it into British English for my benefit, and I
>>     suspect I would hate to read the resulting
>>     text...
>>
>>     Best
>>     Ramesh
>>
>>
>>     At 18:53 09/11/2006, Merle Tenney wrote:
>>
>>     Ramesh Krishnamurthy wrote:
>>     >
>>     > ...and there is no obvious parallel corpus of Br-Am Eng to
>>     consult...
>>     > Do you know of one by any chance...
>>     >
>>     > And Mark P. Line responded:
>>     >
>>     >Why would it have to be a *parallel* corpus?
>>
>>     In a word, alignment. The formative work in parallel corpora has
>>     come from the machine translation crowd, especially the
>>     statistical machine researchers. The primary purpose of having a
>>     parallel corpus is to align translationally equivalent documents
>>     in two languages, first at the sentence level, then at the word
>>     and phrase level, in order to establish word and phrase
>>     equivalences. A secondary purpose, deriving from the
>>     sentence-level alignment, is to produce source and target
>>     sentence pairs to prime the pump for translation memory systems.
>>
>>     Like you, I have wondered why you couldn't study two text corpora
>>     of similar but not equivalent texts and compare them in their
>>     totality. Of course you can, but is there any way in this
>>     scenario to come up with meaningful term-level comparisons, as
>>     good as you can get with parallel corpora? I can see two ways you
>>     might proceed:
>>
>>     The first method largely begs the question of term equivalence.
>>     You begin with a set of known related words and you compare their
>>     frequencies and distributions. So if you are studying language
>>     models, you compare /sheer/, /complete/, and /utter /as a group.
>>     If you are studying dialect differences, you study /diaper/ and
>>     /nappy/ or /bonnet/ and /hood/ (clothing and automotive). If you
>>     are studying translation equivalence in English and Spanish, you
>>     study /flag/, /banner/, /standard/, /pendant/ alongside
>>     /bandera/, /estandarte/, /pabellón/ (and /flag/, /flagstone/ vs.
>>     /losa/, /lancha/; /flag/, /fail,/ /languish/, /weaken/ vs.
>>     /flaquear/, /debilitarse/, /languidecer/; etc.). The point is,
>>     you already have your comparable sets going in, and you study
>>     their usage across a broad corpus. One problem here is that you
>>     need to have a strong word sense disambiguation component or you
>>     need to work with a word sense-tagged corpus to deal with
>>     homophonous and polysemous terms like /sheer/, /bonnet/, /flat/,
>>     and /flag, /so you still have some hard work left even if you
>>     start with the related word groups.
>>
>>     The second method does not begin, a priori, with sets of related
>>     words. In fact, generating synonyms, dialectal variants, and
>>     translation equivalents is one of its more interesting
>>     challenges. Detailed lexical, collocational, and syntactic
>>     characterizations is another. Again, this is much easier to do if
>>     you are working with parallel corpora. If you are dealing with
>>     large, nonparallel texts, this is a real challenge. Other than
>>     inflected and lemmatized word forms, there are a few more hooks
>>     that can be applied, including POS tagging and WSD. Even if both
>>     of these technologies perform well, however, that is still not
>>     enough to get you to the quality of data that you get with
>>     parallel corpora.
>>
>>     Mark, if you can figure out a way to combine the quality and
>>     quantity of data from a very large corpus with the alignment and
>>     equivalence power of a parallel corpus without actually having a
>>     parallel corpus, I will personally nominate you for the Nobel
>>     Prize in Corpus Linguistics. J
>>
>>     Merle
>>
>>     PS and Shameless Microsoft Plug: In the last paragraph, I
>>     accidentally typed “figure out a why to combine” and I got the
>>     blue squiggle from Word 2007, which was released to manufacturing
>>     on Monday of this week. It suggested /way/, and of course I took
>>     the suggestion. I am amazed at the number of mistakes that the
>>     contextual speller has caught in my writing since I started using
>>     it. I recommend the new version of Word and Office for this
>>     feature alone. J
>>
>>     Ramesh Krishnamurthy
>>
>>     Lecturer in English Studies, School of Languages and Social
>>     Sciences, Aston University, Birmingham B4 7ET, UK
>>     [Room NX08, North Wing of Main Building] ; Tel: +44
>>     (0)121-204-3812 ; Fax: +44 (0)121-204-3766
>>     http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
>>
>>     Project Leader, ACORN (Aston Corpus Network):
>>     http://corpus.aston.ac.uk/ 
>
>     Ramesh Krishnamurthy
>
>     Lecturer in English Studies, School of Languages and Social
>     Sciences, Aston University, Birmingham B4 7ET, UK
>     [Room NX08, North Wing of Main Building] ; Tel: +44
>     (0)121-204-3812 ; Fax: +44 (0)121-204-3766
>     http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
>
>     Project Leader, ACORN (Aston Corpus Network):
>     http://corpus.aston.ac.uk/
>     _________________________________________________________
>     This e-mail has been scanned for viruses by MessageLabs.
>