[Corpora-List] "errors and the art of correcting"

TadPiotr tadpiotr at plusnet.pl
Sat Nov 11 20:50:43 UTC 2006


A collection of corpora along those lines -- native vs non-native English --
have been compiled by Sylviane Granger. At least the Polish sub-corpus
contained texts corrected later by native speakers. The analysis of the
errors was done by Przemek Kaszubski in his PhD. Here are some quotations
and links:

" One of the major international collections built on strict sampling
principles is the International Corpus of Learner English (ICLE), which
contains argumentative essays acquired from learners in more than a dozen
different EFL countries in Europe and beyond. Although the ICLE corpus is
not yet available to the public, research on it has been carried out for
years. "
Przemek Kaszubski http://www.hltmag.co.uk/dec99/idea.htm

The Louvain Centre for English Corpus Linguistics has played a pioneering
role in promoting computer learner corpora (CLC) and was among the first, if
not the first, to begin compiling such a corpus. The Centre's computerised
databank is known as the International Corpusof Learner English (ICLE) and
is the result of over ten  years of collaborative activity between a number
of universities internationally and currently contains over 2 million words
of writing by learners of English from 19 different mother tongue
backgrounds. The writing in the corpus has been contributed by advanced
learners of English as a foreign language rather than as a second language
and is made up of 19 distinct sub-corpora,each containing one language
variety (E2French, E2German, E2Swedish etc). The type of writing being
collected is essay writing (see below for fuller details). Advanced students
can, for the purpose of the project, be broadly defined as university
students of English in their 3rd or 4th year of study. In cases where the
comparability of the level is in doubt, sample pieces of writing should be
submitted beforehand.   
http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm#heading1
 
Best
Tadeusz Piotrowski

> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Ramesh Krishnamurthy
> Sent: Saturday, November 11, 2006 5:01 PM
> To: Pete Whitelock; CORPORA at UIB.NO
> Subject: Re: [Corpora-List] "errors and the art of correcting"
> 
> Hi Pete
> I worked with Diane on the SENSEVAL project, where six 
> experienced lexicographers tried to assign corpus examples to 
> dictionary senses. There were substantial disagreements. I 
> suspect the same would be true if you asked 6 teachers to 
> correct the same learner texts. I have done a considerable 
> amount of editing/correcting of native-speaker-English and 
> non-native-speaker English, from draft academic papers by 
> very senior colleagues, to PhD theses, to undergraduate 
> essays, and there are very few 'errors' (even what you might 
> consider to be "gross mechanical errors of learner English") 
> that I couldn't suggest alternative corrections for, 
> depending on where the error is deemed to be located.
> Talking to my "correctees" afterwards usually makes me 
> realize that I quite frequently misread their intentions, and 
> therefore offered the wrong corrections/suggestions.
> Best
> Ramesh
> 
> 
> Pete Whitelock wrote:
> > Re: papers written by English language learners *and the corrected 
> > versions with interference problems removed. * I don't know about 
> > removal of interference problems in their entirety, but since it's 
> > perfectly possible for a teacher to correct the gross mechanical 
> > errors of learner English (wrong or missing articles, wrong 
> > participles etc.), it would be possible for an annotator to do the 
> > same. And in fact, CUP have 11m words of Cambridge Suite 
> Examination 
> > (KET, PET, FCE, CAE, CPE) scripts marked up in this 
> fashion, mostly by 
> > Diane Nicholls, and there's no cause for Ramesh to be 
> disturbed by it.
> > Pete
> >
> >     
> --------------------------------------------------------------
> ----------
> >     *From:* owner-corpora at lists.uib.no
> >     [mailto:owner-corpora at lists.uib.no] *On Behalf Of *Ramesh
> >     Krishnamurthy
> >     *Sent:* 10 November 2006 23:41
> >     *To:* Merle Tenney; CORPORA at UIB.NO
> >     *Subject:* RE: [Corpora-List] Parallel corpora and word 
> alignment,
> >     WAS: American and British English spelling converter
> >
> >     Hi Merle,
> >
> >     Yes, I was aware of parallel corpora in 2 or more languages.
> >     In fact, it's part of the corpus development we've 
> initiated at Aston
> >     (please see http://corpus.aston.ac.uk).
> >
> >     But it intrigued me to think of parallel corpora 
> *within* a language.
> >     I suppose dialectal texts rendered into "standard" language or
> >     vice versa
> >     might come close... I need to muse some more on this.
> >
> >>     Another variant on the parallel corpus theme is papers 
> written by
> >>     English language learners *and the corrected versions with
> >>     interference problems removed. *
> >     I'm not sure how this could be done without making huge 
> intuitive
> >     leaps as to what the 'errors' were,
> >     and what the 'interference problems' were... I'm afraid a lot of
> >     the error analysis I've seen leaves me
> >     greatly disturbed....
> >
> >     Best
> >     Ramesh
> >
> >
> >     At 23:09 10/11/2006, Merle Tenney wrote:
> >>     Ramesh,
> >>
> >>     Lots of people are working with parallel corpora in two or more
> >>     languages. Honestly, I don’t know of any effort to acquire
> >>     parallel corpora of two or more varieties of English, French,
> >>     Portuguese, etc. I should think that sources for such corpora
> >>     must exist, though not nearly to the extent that they exist for
> >>     texts in different languages. Another variant on the parallel
> >>     corpus theme is papers written by English language learners and
> >>     the corrected versions with interference problems 
> removed. Again,
> >>     it is not hard to imagine that such sources exist, but I cannot
> >>     provide a reference for either sort of same-language 
> corpus. Can
> >>     someone point Ramesh and me in the right direction?
> >>
> >>     Merle
> >>
> >>     *From:* Ramesh Krishnamurthy [ 
> mailto:r.krishnamurthy at aston.ac.uk]
> >>     *Sent:* Friday, November 10, 2006 6:46 AM
> >>     *To:* Merle Tenney; Mark P. Line; CORPORA at UIB.NO
> >>     *Subject:* Re: [Corpora-List] Parallel corpora and word
> >>     alignment, WAS: American and British English spelling converter
> >>
> >>     Hi Merle
> >>     I must admit I hadn't been thinking of "parallel" corpora along
> >>     such strict-definition lines.
> >>
> >>     So who is creating large amounts of 'parallel' data (in the
> >>     technical/translation sense)
> >>     for British English and American English? I wouldn't 
> have thought
> >>     there was a very large
> >>     market....?
> >>
> >>     Noah Smith mentioned Harry Potter, and I must admit I'm quite
> >>     surprised to discover
> >>     that publishers are making such changes as
> >>
> >>     They had drawn for the house cup
> >>     They had tied for the house cup
> >>     Perhaps because it's "children's" literature? Or at 
> least read by
> >>     many children,
> >>     who may not be willing/able to cross varietal boundaries with
> >>     total comfort.
> >>
> >>     But when I read a novel by an American author, I 
> accept that it's
> >>     part of my role as reader to
> >>     take on board any varietal differences as part of the 
> context. I
> >>     can't imagine anyone wanting
> >>     to translate it into British English for my benefit, and I
> >>     suspect I would hate to read the resulting
> >>     text...
> >>
> >>     Best
> >>     Ramesh
> >>
> >>
> >>     At 18:53 09/11/2006, Merle Tenney wrote:
> >>
> >>     Ramesh Krishnamurthy wrote:
> >>     >
> >>     > ...and there is no obvious parallel corpus of Br-Am Eng to
> >>     consult...
> >>     > Do you know of one by any chance...
> >>     >
> >>     > And Mark P. Line responded:
> >>     >
> >>     >Why would it have to be a *parallel* corpus?
> >>
> >>     In a word, alignment. The formative work in parallel 
> corpora has
> >>     come from the machine translation crowd, especially the
> >>     statistical machine researchers. The primary purpose 
> of having a
> >>     parallel corpus is to align translationally equivalent 
> documents
> >>     in two languages, first at the sentence level, then at the word
> >>     and phrase level, in order to establish word and phrase
> >>     equivalences. A secondary purpose, deriving from the
> >>     sentence-level alignment, is to produce source and target
> >>     sentence pairs to prime the pump for translation 
> memory systems.
> >>
> >>     Like you, I have wondered why you couldn't study two 
> text corpora
> >>     of similar but not equivalent texts and compare them in their
> >>     totality. Of course you can, but is there any way in this
> >>     scenario to come up with meaningful term-level comparisons, as
> >>     good as you can get with parallel corpora? I can see 
> two ways you
> >>     might proceed:
> >>
> >>     The first method largely begs the question of term equivalence.
> >>     You begin with a set of known related words and you 
> compare their
> >>     frequencies and distributions. So if you are studying language
> >>     models, you compare /sheer/, /complete/, and /utter 
> /as a group.
> >>     If you are studying dialect differences, you study /diaper/ and
> >>     /nappy/ or /bonnet/ and /hood/ (clothing and 
> automotive). If you
> >>     are studying translation equivalence in English and 
> Spanish, you
> >>     study /flag/, /banner/, /standard/, /pendant/ alongside
> >>     /bandera/, /estandarte/, /pabellón/ (and /flag/, 
> /flagstone/ vs.
> >>     /losa/, /lancha/; /flag/, /fail,/ /languish/, /weaken/ vs.
> >>     /flaquear/, /debilitarse/, /languidecer/; etc.). The point is,
> >>     you already have your comparable sets going in, and you study
> >>     their usage across a broad corpus. One problem here is that you
> >>     need to have a strong word sense disambiguation 
> component or you
> >>     need to work with a word sense-tagged corpus to deal with
> >>     homophonous and polysemous terms like /sheer/, 
> /bonnet/, /flat/,
> >>     and /flag, /so you still have some hard work left even if you
> >>     start with the related word groups.
> >>
> >>     The second method does not begin, a priori, with sets 
> of related
> >>     words. In fact, generating synonyms, dialectal variants, and
> >>     translation equivalents is one of its more interesting
> >>     challenges. Detailed lexical, collocational, and syntactic
> >>     characterizations is another. Again, this is much 
> easier to do if
> >>     you are working with parallel corpora. If you are dealing with
> >>     large, nonparallel texts, this is a real challenge. Other than
> >>     inflected and lemmatized word forms, there are a few more hooks
> >>     that can be applied, including POS tagging and WSD. 
> Even if both
> >>     of these technologies perform well, however, that is still not
> >>     enough to get you to the quality of data that you get with
> >>     parallel corpora.
> >>
> >>     Mark, if you can figure out a way to combine the quality and
> >>     quantity of data from a very large corpus with the 
> alignment and
> >>     equivalence power of a parallel corpus without 
> actually having a
> >>     parallel corpus, I will personally nominate you for the Nobel
> >>     Prize in Corpus Linguistics. J
> >>
> >>     Merle
> >>
> >>     PS and Shameless Microsoft Plug: In the last paragraph, I
> >>     accidentally typed “figure out a why to combine” and I got the
> >>     blue squiggle from Word 2007, which was released to 
> manufacturing
> >>     on Monday of this week. It suggested /way/, and of 
> course I took
> >>     the suggestion. I am amazed at the number of mistakes that the
> >>     contextual speller has caught in my writing since I 
> started using
> >>     it. I recommend the new version of Word and Office for this
> >>     feature alone. J
> >>
> >>     Ramesh Krishnamurthy
> >>
> >>     Lecturer in English Studies, School of Languages and Social
> >>     Sciences, Aston University, Birmingham B4 7ET, UK
> >>     [Room NX08, North Wing of Main Building] ; Tel: +44
> >>     (0)121-204-3812 ; Fax: +44 (0)121-204-3766
> >>     http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
> >>
> >>     Project Leader, ACORN (Aston Corpus Network):
> >>     http://corpus.aston.ac.uk/
> >
> >     Ramesh Krishnamurthy
> >
> >     Lecturer in English Studies, School of Languages and Social
> >     Sciences, Aston University, Birmingham B4 7ET, UK
> >     [Room NX08, North Wing of Main Building] ; Tel: +44
> >     (0)121-204-3812 ; Fax: +44 (0)121-204-3766
> >     http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
> >
> >     Project Leader, ACORN (Aston Corpus Network):
> >     http://corpus.aston.ac.uk/
> >     _________________________________________________________
> >     This e-mail has been scanned for viruses by MessageLabs.
> >
> 
> 
> 
> 



More information about the Corpora mailing list