[Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

Ramesh Krishnamurthy r.krishnamurthy at aston.ac.uk
Fri Nov 10 14:45:52 UTC 2006


Hi Merle
I must admit I hadn't been thinking of "parallel" 
corpora along such strict-definition lines.

So who is creating large amounts of 'parallel' 
data (in the technical/translation sense)
for British English and American English? I 
wouldn't have thought there was a very large
market....?

Noah Smith mentioned Harry Potter, and I must 
admit I'm quite surprised to discover
that publishers are making such changes as
>    They had drawn for the house cup
>    They had tied for the house cup
Perhaps because it's "children's" literature? Or 
at least read by many children,
who may not be willing/able to cross varietal boundaries with total comfort.

But when I read a novel by an American author, I 
accept that it's part of my role as reader to
take on board any varietal differences as part of 
the context. I can't imagine anyone wanting
to translate it into British English for my 
benefit, and I suspect I would hate to read the resulting
text...

Best
Ramesh


At 18:53 09/11/2006, Merle Tenney wrote:
>Ramesh Krishnamurthy wrote:
> >
> > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
> > Do you know of one by any chance...
> >
> > And Mark P. Line responded:
> >
> >Why would it have to be a *parallel* corpus?
>
>In a word, alignment.  The formative work in 
>parallel corpora has come from the machine 
>translation crowd, especially the statistical 
>machine researchers.  The primary purpose of 
>having a parallel corpus is to align 
>translationally equivalent documents in two 
>languages, first at the sentence level, then at 
>the word and phrase level, in order to establish 
>word and phrase equivalences.  A secondary 
>purpose, deriving from the sentence-level 
>alignment, is to produce source and target 
>sentence pairs to prime the pump for translation memory systems.
>
>Like you, I have wondered why you couldn't study 
>two text corpora of similar but not equivalent 
>texts and compare them in their totality.  Of 
>course you can, but is there any way in this 
>scenario to come up with meaningful term-level 
>comparisons, as good as you can get with 
>parallel corpora?  I can see two ways you might proceed:
>
>The first method largely begs the question of 
>term equivalence.  You begin with a set of known 
>related words and you compare their frequencies 
>and distributions.  So if you are studying 
>language models, you compare sheer, complete, 
>and utter as a group.  If you are studying 
>dialect differences, you study diaper and nappy 
>or bonnet and hood (clothing and 
>automotive).  If you are studying translation 
>equivalence in English and Spanish, you study 
>flag, banner, standard, pendant alongside 
>bandera, estandarte, pabellón (and flag, 
>flagstone vs. losa, lancha; flag, fail, 
>languish, weaken vs. flaquear, debilitarse, 
>languidecer; etc.).  The point is, you already 
>have your comparable sets going in, and you 
>study their usage across a broad corpus.  One 
>problem here is that you need to have a strong 
>word sense disambiguation component or you need 
>to work with a word sense-tagged corpus to deal 
>with homophonous and polysemous terms like 
>sheer, bonnet, flat, and flag, so you still have 
>some hard work left even if you start with the related word groups.
>
>The second method does not begin, a priori, with 
>sets of related words.  In fact, generating 
>synonyms, dialectal variants, and translation 
>equivalents is one of its more interesting 
>challenges.  Detailed lexical, collocational, 
>and syntactic characterizations is 
>another.  Again, this is much easier to do if 
>you are working with parallel corpora.  If you 
>are dealing with large, nonparallel texts, this 
>is a real challenge.  Other than inflected and 
>lemmatized word forms, there are a few more 
>hooks that can be applied, including POS tagging 
>and WSD.  Even if both of these technologies 
>perform well, however, that is still not enough 
>to get you to the quality of data that you get with parallel corpora.
>
>Mark, if you can figure out a way to combine the 
>quality and quantity of data from a very large 
>corpus with the alignment and equivalence power 
>of a parallel corpus without actually having a 
>parallel corpus, I will personally nominate you 
>for the Nobel Prize in Corpus Linguistics.  J
>
>Merle
>
>PS and Shameless Microsoft Plug:  In the last 
>paragraph, I accidentally typed “figure out a 
>why to combine” and I got the blue squiggle from 
>Word 2007, which was released to manufacturing 
>on Monday of this week.  It suggested way, and 
>of course I took the suggestion.  I am amazed at 
>the number of mistakes that the contextual 
>speller has caught in my writing since I started 
>using it.  I recommend the new version of Word 
>and Office for this feature alone.  J

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages 
and Social Sciences, Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: 
+44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061110/b77ae39c/attachment.htm>


More information about the Corpora mailing list