[Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

Fri Nov 10 23:51:31 UTC 2006

Ramesh,

Actually, there have been a lot of studies of language learner errors.  Many of the errors are not that much of a reach or subject to researcher interpretation-use of actually instead of currently, use of depend of instead of depend on, use of informations instead of information, etc.  A corpus which took a strict view of learner errors and associated those errors with correct native forms, via a parallel corpus of corrected texts or a rich tagging scheme, would be very useful for studying interference errors.

Merle

From: Ramesh Krishnamurthy [mailto:r.krishnamurthy at aston.ac.uk]
Sent: Friday, November 10, 2006 3:41 PM
To: Merle Tenney; CORPORA at UIB.NO
Subject: RE: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

Hi Merle,

Yes, I was aware of parallel corpora in 2 or more languages.
In fact, it's part of the corpus development we've initiated at Aston
(please see http://corpus.aston.ac.uk).

But it intrigued me to think of parallel corpora *within* a language.
I suppose dialectal texts rendered into "standard" language or vice versa
might come close... I need to muse some more on this.

Another variant on the parallel corpus theme is papers written by English language learners and the corrected versions with interference problems removed.
I'm not sure how this could be done without making huge intuitive leaps as to what the 'errors' were,
and what the 'interference problems' were... I'm afraid a lot of the error analysis I've seen leaves me
greatly disturbed....

Best
Ramesh

At 23:09 10/11/2006, Merle Tenney wrote:

Ramesh,

Lots of people are working with parallel corpora in two or more languages.  Honestly, I don't know of any effort to acquire parallel corpora of two or more varieties of English, French, Portuguese, etc.  I should think that sources for such corpora must exist, though not nearly to the extent that they exist for texts in different languages.  Another variant on the parallel corpus theme is papers written by English language learners and the corrected versions with interference problems removed.  Again, it is not hard to imagine that such sources exist, but I cannot provide a reference for either sort of same-language corpus.  Can someone point Ramesh and me in the right direction?

Merle

From: Ramesh Krishnamurthy [ mailto:r.krishnamurthy at aston.ac.uk]
Sent: Friday, November 10, 2006 6:46 AM
To: Merle Tenney; Mark P. Line; CORPORA at UIB.NO
Subject: Re: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

Hi Merle
I must admit I hadn't been thinking of "parallel" corpora along such strict-definition lines.

So who is creating large amounts of 'parallel' data (in the technical/translation sense)
for British English and American English? I wouldn't have thought there was a very large
market....?

Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised to discover
that publishers are making such changes as

   They had drawn for the house cup
   They had tied for the house cup
Perhaps because it's "children's" literature? Or at least read by many children,
who may not be willing/able to cross varietal boundaries with total comfort.

But when I read a novel by an American author, I accept that it's part of my role as reader to
take on board any varietal differences as part of the context. I can't imagine anyone wanting
to translate it into British English for my benefit, and I suspect I would hate to read the resulting
text...

Best
Ramesh

At 18:53 09/11/2006, Merle Tenney wrote:

Ramesh Krishnamurthy wrote:
>
> ...and there is no obvious parallel corpus of Br-Am Eng to consult...
> Do you know of one by any chance...
>
> And Mark P. Line responded:
>
>Why would it have to be a *parallel* corpus?

In a word, alignment.  The formative work in parallel corpora has come from the machine translation crowd, especially the statistical machine researchers.  The primary purpose of having a parallel corpus is to align translationally equivalent documents in two languages, first at the sentence level, then at the word and phrase level, in order to establish word and phrase equivalences.  A secondary purpose, deriving from the sentence-level alignment, is to produce source and target sentence pairs to prime the pump for translation memory systems.

Like you, I have wondered why you couldn't study two text corpora of similar but not equivalent texts and compare them in their totality.  Of course you can, but is there any way in this scenario to come up with meaningful term-level comparisons, as good as you can get with parallel corpora?  I can see two ways you might proceed:

The first method largely begs the question of term equivalence.  You begin with a set of known related words and you compare their frequencies and distributions.  So if you are studying language models, you compare sheer, complete, and utter as a group.  If you are studying dialect differences, you study diaper and nappy or bonnet and hood (clothing and automotive).  If you are studying translation equivalence in English and Spanish, you study flag, banner, standard, pendant alongside bandera, estandarte, pabellón (and flag, flagstone vs. losa, lancha; flag, fail, languish, weaken vs. flaquear, debilitarse, languidecer; etc.).  The point is, you already have your comparable sets going in, and you study their usage across a broad corpus.  One problem here is that you need to have a strong word sense disambiguation component or you need to work with a word sense-tagged corpus to deal with homophonous and polysemous terms like sheer, bonnet, flat, and flag, so you still have some hard work left even if you start with the related word groups.

The second method does not begin, a priori, with sets of related words.  In fact, generating synonyms, dialectal variants, and translation equivalents is one of its more interesting challenges.  Detailed lexical, collocational, and syntactic characterizations is another.  Again, this is much easier to do if you are working with parallel corpora.  If you are dealing with large, nonparallel texts, this is a real challenge.  Other than inflected and lemmatized word forms, there are a few more hooks that can be applied, including POS tagging and WSD.  Even if both of these technologies perform well, however, that is still not enough to get you to the quality of data that you get with parallel corpora.

Mark, if you can figure out a way to combine the quality and quantity of data from a very large corpus with the alignment and equivalence power of a parallel corpus without actually having a parallel corpus, I will personally nominate you for the Nobel Prize in Corpus Linguistics.  J

Merle

PS and Shameless Microsoft Plug:  In the last paragraph, I accidentally typed "figure out a why to combine" and I got the blue squiggle from Word 2007, which was released to manufacturing on Monday of this week.  It suggested way, and of course I took the suggestion.  I am amazed at the number of mistakes that the contextual speller has caught in my writing since I started using it.  I recommend the new version of Word and Office for this feature alone.  J

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/