[Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

TadPiotr tadpiotr at plusnet.pl
Fri Nov 10 14:57:22 UTC 2006


Hello All
those of us who deal with speech might be also interested to know that there
are different American and British audio tracks on movies on DVD. (There is
a version of Zorro with Anthony Hopkins, and I was wondering whether he did
both versions.) I have no idea whether the differences are only in
pronunciation or perhaps also lexical and other ones. 
However, that means there is quite a lot of material waiting to be
described.
Best wishes
Tadeusz Piotrowski


  _____  

From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Ramesh Krishnamurthy
Sent: Friday, November 10, 2006 3:46 PM
To: Merle Tenney; Mark P. Line; CORPORA at UIB.NO
Subject: Re: [Corpora-List] Parallel corpora and word alignment, WAS:
American and British English spelling converter


Hi Merle
I must admit I hadn't been thinking of "parallel" corpora along such
strict-definition lines.

So who is creating large amounts of 'parallel' data (in the
technical/translation sense)
for British English and American English? I wouldn't have thought there was
a very large 
market....?

Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised to
discover 
that publishers are making such changes as


   They had drawn for the house cup
   They had tied for the house cup

Perhaps because it's "children's" literature? Or at least read by many
children, 
who may not be willing/able to cross varietal boundaries with total comfort.

But when I read a novel by an American author, I accept that it's part of my
role as reader to 
take on board any varietal differences as part of the context. I can't
imagine anyone wanting
to translate it into British English for my benefit, and I suspect I would
hate to read the resulting 
text...

Best
Ramesh


At 18:53 09/11/2006, Merle Tenney wrote:


Ramesh Krishnamurthy wrote:
> 
> ...and there is no obvious parallel corpus of Br-Am Eng to consult...
> Do you know of one by any chance...
> 
> And Mark P. Line responded:
> 
>Why would it have to be a *parallel* corpus?
 
In a word, alignment.  The formative work in parallel corpora has come from
the machine translation crowd, especially the statistical machine
researchers.  The primary purpose of having a parallel corpus is to align
translationally equivalent documents in two languages, first at the sentence
level, then at the word and phrase level, in order to establish word and
phrase equivalences.  A secondary purpose, deriving from the sentence-level
alignment, is to produce source and target sentence pairs to prime the pump
for translation memory systems.
 
Like you, I have wondered why you couldn't study two text corpora of similar
but not equivalent texts and compare them in their totality.  Of course you
can, but is there any way in this scenario to come up with meaningful
term-level comparisons, as good as you can get with parallel corpora?  I can
see two ways you might proceed:
 
The first method largely begs the question of term equivalence.  You begin
with a set of known related words and you compare their frequencies and
distributions.  So if you are studying language models, you compare sheer,
complete, and utter as a group.  If you are studying dialect differences,
you study diaper and nappy or bonnet and hood (clothing and automotive).  If
you are studying translation equivalence in English and Spanish, you study
flag, banner, standard, pendant alongside bandera, estandarte, pabellón (and
flag, flagstone vs. losa, lancha; flag, fail, languish, weaken vs. flaquear,
debilitarse, languidecer; etc.).  The point is, you already have your
comparable sets going in, and you study their usage across a broad corpus.
One problem here is that you need to have a strong word sense disambiguation
component or you need to work with a word sense-tagged corpus to deal with
homophonous and polysemous terms like sheer, bonnet, flat, and flag, so you
still have some hard work left even if you start with the related word
groups.
 
The second method does not begin, a priori, with sets of related words.  In
fact, generating synonyms, dialectal variants, and translation equivalents
is one of its more interesting challenges.  Detailed lexical, collocational,
and syntactic characterizations is another.  Again, this is much easier to
do if you are working with parallel corpora.  If you are dealing with large,
nonparallel texts, this is a real challenge.  Other than inflected and
lemmatized word forms, there are a few more hooks that can be applied,
including POS tagging and WSD.  Even if both of these technologies perform
well, however, that is still not enough to get you to the quality of data
that you get with parallel corpora.
 
Mark, if you can figure out a way to combine the quality and quantity of
data from a very large corpus with the alignment and equivalence power of a
parallel corpus without actually having a parallel corpus, I will personally
nominate you for the Nobel Prize in Corpus Linguistics.  J
 
Merle
 
PS and Shameless Microsoft Plug:  In the last paragraph, I accidentally
typed “figure out a why to combine” and I got the blue squiggle from Word
2007, which was released to manufacturing on Monday of this week.  It
suggested way, and of course I took the suggestion.  I am amazed at the
number of mistakes that the contextual speller has caught in my writing
since I started using it.  I recommend the new version of Word and Office
for this feature alone.  J

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages and Social Sciences, Aston
University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax:
+44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061110/53af3b6a/attachment.htm>


More information about the Corpora mailing list