[Corpora-List] RE: [Corpora-List] Parallel corpora and word alig nment, WAS: American and British English spelling converter

Santos Diana Diana.Santos at sintef.no
Thu Nov 16 14:39:28 UTC 2006


I just want to support Ramesh's claims, basically, that it is much easier to spot that there is a problem, than provide the right and only correction.

We had exactly the same problem when trying to come up with a grid for human evaluation of machine translation, see e.g. Santos, Diana, Belinda Maia & Luís Sarmento. "Gathering empirical data to evaluate MT from English to Portuguese", Proceedings of LREC 2004 Satellite Workshop on the Amazing Utility of Parallel and Comparable Corpora (Lisboa, Portugal, 25 May 2004), pp. 14-17. http://www.linguateca.pt/Diana/download/SantosMaiaSarmentoAmazing2004.pdf

Actually, my conclusion was that it was better to have people suggest an alternative full translation than trying to correct / point out precisely which was the error. 

To find the cause of the error - and therefore appropriately correct it -- is in most of the cases extremely difficult, and as Ramesh pointed out, different people come up with different solutions/corrections.

I support him also in the opposite role: there have been many cases when my texts in English have been corrected by a native speaker, and I concluded they had been misunderstood, so I had to rewrite the corresponding sentences hopefully in a clearer way. I.e., not necessarily accepting the correction that was suggested. 

I also have the same experience when I correct texts in Portuguese, mainly by my students. When I propose radically different ways of expressing something, I often get the answer that this was not what they originally meant -- so this has nothing to do with native or non-native writing, but with the difficulty of univocal correction.

I wonder if anyone can point me to studies about percentage of corrected sentences which actually get accepted in real environments, instead of being more or less rewritten after attention is drawn to them by the correction itself?

Diana


> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Ramesh Krishnamurthy
> Sent: 11. november 2006 01:35
> To: Merle Tenney
> Cc: CORPORA at UIB.NO
> Subject: RE: [Corpora-List] Parallel corpora and word 
> alignment, WAS: American and British English spelling converter
> 
> Hi Merle
> Yes, again, I am reasonably aware of the work going on with 
> learner corpora.
> 
> >Many of the errors are not that much of a reach or subject to 
> >researcher interpretation - use of actually instead of 
> currently, use 
> >of depend of instead of depend on, use of informations instead of 
> >information, etc.
> 
> But, to take just one example you mentioned, "accommodations" 
> seems to be fairly common in NAm English, according to the 
> Bank of English corpus (2002: 448m words), and shows some 
> penetration of BR English as well (and not always referring 
> to NAm contexts)...
> 
> Query is "accommodations"
> Term 1 in your query has been selected as the node
> 
> 1230 matching lines
> Corpus         Total Number of       Average Number per
>                 Occurrences           Million Words
> 
> usspok              220                108.7/million
> usephem             145                 41.4/million
> usbooks             429                 13.2/million
> strathy             166                 10.4/million
> usacad               37                  5.8/million
> usnews               45                  4.5/million
> npr                  54                  2.4/million
> wbe                  11                  1.1/million
> brbooks              40                  0.9/million
> indy                 20                  0.7/million
> brephem               3                  0.6/million
> bbc                   7                  0.4/million
> brspok                7                  0.3/million
> guard                11                  0.3/million
> brmags               15                  0.3/million
> oznews                9                  0.3/million
> times                 8                  0.2/million
> econ                  2                  0.1/million
> sunnow                1                  0.0/million
> newsci                0                  0.0/million
> 
> So we need to be careful not to mark this usage wrong in a 
> student's work, without examining the specific context, 
> target variety, etc etc
> 
> Here are a few selected examples (I presume you refer to the 
> 'place to live in' sense of accommodation, rather than the 
> 'making a compromise' sense, which is more commonly countable):
> 
> White Washingtonians did not lack means to discriminate 
> against their black fellow citizens before Wilson came to 
> town in 1913, but the first southerner to occupy the White 
> House since the Civil War did come with something new: the 
> South's system of separate-and-unequal public accommodations 
> and services that survived until it was dismantled by protest 
> movements and court decisions in the 1950s and 1960s.
> 
> Another irritation to European visitors was the absence of 
> special first-class accommodations on steamboats and railroads.
> 
> Congressman Newt Gingrich, a powerful voice against 
> corruption in the House, enjoyed 49 days on the road in lush 
> accommodations in 1992 at the expense of various interest 
> groups, according to House financial reports.
> 
> The bonus for Easterners: free dormitory accommodations.
> 
> A proforma on the screen then asks them for details of their 
> requirements, including neighbourhoods and price range.
> Immediately Gems will display a map of accommodations which 
> fit their requirements. These will have been supplied by 
> anyone with a room to hire who...
> 
> I have also seen many of the type of corrections you mention, 
> and sometimes the trigger for the error is signalled 
> elsewhere in the text, so the real cause is obscured.
> 
> >A corpus which took a strict view of learner errors and associated 
> >those errors with correct native forms
> I'm not sure what you mean by a "strict view" 
> (and surely it would be the observer and not the corpus which 
> took it), and in my experience there may be a variety of 
> "correct native forms" depending on where you perceive the 
> error to be located... e.g. in a case of mis-concord, do you 
> correct the number of the noun group or the form of the verb 
> group? It's not always straightforward. To take one simple example:
> The 10-week course run from the middle of July Do you amend 
> this to "courses" or to "runs"? What was the intention of the writer?
> To make a generic or specific statement?
> 
> I'm afraid I remain far from convinced that this is an easy task.
> 
> I agree that learner data can be very interesting and 
> rewarding to study, but I'm not sure that the inferences to 
> be made are at all obvious.
> 
> Best
> Ramesh
> 
> At 23:51 10/11/2006, you wrote:
> >Ramesh,
> >
> >Actually, there have been a lot of studies of language 
> learner errors.  
> >Many of the errors are not that much of a reach or subject to 
> >researcher interpretation­use of actually instead of 
> currently, use of 
> >depend of instead of depend on, use of informations instead of 
> >information, etc.  A corpus which took a strict view of 
> learner errors 
> >and associated those errors with correct native forms, via a 
> parallel 
> >corpus of corrected texts or a rich tagging scheme, would be very 
> >useful for studying interference errors.
> >
> >Merle
> >
> >From: Ramesh Krishnamurthy [mailto:r.krishnamurthy at aston.ac.uk]
> >Sent: Friday, November 10, 2006 3:41 PM
> >To: Merle Tenney; CORPORA at UIB.NO
> >Subject: RE: [Corpora-List] Parallel corpora and word 
> alignment, WAS: 
> >American and British English spelling converter
> >
> >Hi Merle,
> >
> >Yes, I was aware of parallel corpora in 2 or more languages.
> >In fact, it's part of the corpus development we've initiated 
> at Aston 
> >(please see http://corpus.aston.ac.uk).
> >
> >But it intrigued me to think of parallel corpora *within* a language.
> >I suppose dialectal texts rendered into "standard" language or vice 
> >versa might come close... I need to muse some more on this.
> >
> >
> >Another variant on the parallel corpus theme is papers written by 
> >English language learners and the corrected versions with 
> interference 
> >problems removed.
> >I'm not sure how this could be done without making huge 
> intuitive leaps 
> >as to what the 'errors' were, and what the 'interference problems' 
> >were... I'm afraid a lot of the error analysis I've seen leaves me 
> >greatly disturbed....
> >
> >Best
> >Ramesh
> >
> >
> >At 23:09 10/11/2006, Merle Tenney wrote:
> >
> >Ramesh,
> >
> >Lots of people are working with parallel corpora in two or more 
> >languages.  Honestly, I don't know of any effort to acquire parallel 
> >corpora of two or more varieties of English, French, 
> Portuguese, etc.  
> >I should think that sources for such corpora must exist, though not 
> >nearly to the extent that they exist for texts in different 
> languages.  
> >Another variant on the parallel corpus theme is papers written by 
> >English language learners and the corrected versions with 
> interference 
> >problems removed.  Again, it is not hard to imagine that 
> such sources 
> >exist, but I cannot provide a reference for either sort of 
> >same-language corpus.  Can someone point Ramesh and me in the right 
> >direction?
> >
> >Merle
> >
> >From: Ramesh Krishnamurthy [ mailto:r.krishnamurthy at aston.ac.uk]
> >Sent: Friday, November 10, 2006 6:46 AM
> >To: Merle Tenney; Mark P. Line; CORPORA at UIB.NO
> >Subject: Re: [Corpora-List] Parallel corpora and word 
> alignment, WAS: 
> >American and British English spelling converter
> >
> >Hi Merle
> >I must admit I hadn't been thinking of "parallel" corpora along such 
> >strict-definition lines.
> >
> >So who is creating large amounts of 'parallel' 
> >data (in the technical/translation sense) for British English and 
> >American English? I wouldn't have thought there was a very large 
> >market....?
> >
> >Noah Smith mentioned Harry Potter, and I must admit I'm 
> quite surprised 
> >to discover that publishers are making such changes as
> >
> >    They had drawn for the house cup
> >    They had tied for the house cup
> >Perhaps because it's "children's" literature? Or at least 
> read by many 
> >children, who may not be willing/able to cross varietal 
> boundaries with 
> >total comfort.
> >
> >But when I read a novel by an American author, I accept that 
> it's part 
> >of my role as reader to take on board any varietal 
> differences as part 
> >of the context. I can't imagine anyone wanting to translate it into 
> >British English for my benefit, and I suspect I would hate 
> to read the 
> >resulting text...
> >
> >Best
> >Ramesh
> >
> >
> >At 18:53 09/11/2006, Merle Tenney wrote:
> >
> >Ramesh Krishnamurthy wrote:
> > >
> > > ...and there is no obvious parallel corpus of Br-Am Eng 
> to consult...
> > > Do you know of one by any chance...
> > >
> > > And Mark P. Line responded:
> > >
> > >Why would it have to be a *parallel* corpus?
> >
> >In a word, alignment.  The formative work in parallel 
> corpora has come 
> >from the machine translation crowd, especially the 
> statistical machine 
> >researchers.  The primary purpose of having a parallel corpus is to 
> >align translationally equivalent documents in two languages, 
> first at 
> >the sentence level, then at the word and phrase level, in order to 
> >establish word and phrase equivalences.  A secondary 
> purpose, deriving 
> >from the sentence-level alignment, is to produce source and target 
> >sentence pairs to prime the pump for translation memory systems.
> >
> >Like you, I have wondered why you couldn't study two text corpora of 
> >similar but not equivalent texts and compare them in their 
> totality.  
> >Of course you can, but is there any way in this scenario to come up 
> >with meaningful term-level comparisons, as good as you can get with 
> >parallel corpora?  I can see two ways you might proceed:
> >
> >The first method largely begs the question of term equivalence.  You 
> >begin with a set of known related words and you compare their 
> >frequencies and distributions.  So if you are studying 
> language models, 
> >you compare sheer, complete, and utter as a group.  If you 
> are studying 
> >dialect differences, you study diaper and nappy or bonnet and hood 
> >(clothing and automotive).  If you are studying translation 
> equivalence 
> >in English and Spanish, you study flag, banner, standard, pendant 
> >alongside bandera, estandarte, pabellón (and flag, flagstone 
> vs. losa, 
> >lancha; flag, fail, languish, weaken vs. flaquear, debilitarse, 
> >languidecer; etc.).  The point is, you already have your comparable 
> >sets going in, and you study their usage across a broad corpus.  One 
> >problem here is that you need to have a strong word sense 
> >disambiguation component or you need to work with a word 
> sense-tagged 
> >corpus to deal with homophonous and polysemous terms like sheer, 
> >bonnet, flat, and flag, so you still have some hard work 
> left even if 
> >you start with the related word groups.
> >
> >The second method does not begin, a priori, with sets of 
> related words.  
> >In fact, generating synonyms, dialectal variants, and translation 
> >equivalents is one of its more interesting challenges.  Detailed 
> >lexical, collocational, and syntactic characterizations is another.  
> >Again, this is much easier to do if you are working with parallel 
> >corpora.  If you are dealing with large, nonparallel texts, 
> this is a 
> >real challenge.  Other than inflected and lemmatized word 
> forms, there 
> >are a few more hooks that can be applied, including POS tagging and 
> >WSD.  Even if both of these technologies perform well, 
> however, that is 
> >still not enough to get you to the quality of data that you get with 
> >parallel corpora.
> >
> >Mark, if you can figure out a way to combine the quality and 
> quantity 
> >of data from a very large corpus with the alignment and equivalence 
> >power of a parallel corpus without actually having a 
> parallel corpus, I 
> >will personally nominate you for the Nobel Prize in Corpus 
> Linguistics.  
> >J
> >
> >Merle
> >
> >PS and Shameless Microsoft Plug:  In the last paragraph, I 
> accidentally 
> >typed "figure out a why to combine" and I got the blue squiggle from 
> >Word 2007, which was released to manufacturing on Monday of 
> this week.  
> >It suggested way, and of course I took the suggestion.  I am 
> amazed at 
> >the number of mistakes that the contextual speller has caught in my 
> >writing since I started using it.  I recommend the new 
> version of Word 
> >and Office for this feature alone.  J
> >
> >Ramesh Krishnamurthy
> >
> >Lecturer in English Studies, School of Languages and Social 
> Sciences, 
> >Aston University, Birmingham B4 7ET, UK [Room NX08, North 
> Wing of Main 
> >Building] ; Tel:
> >+44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
> ><http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp>http://w
> ww.aston.a
> >c.uk/lss/staff/krishnamurthyr.jsp
> >
> >Project Leader, ACORN (Aston Corpus Network): 
> ><http://corpus.aston.ac.uk/>http://corpus.aston.ac.uk/
> 
> 



More information about the Corpora mailing list