<html>

<body>

Hi Merle<br>

I must admit I hadn't been thinking of "parallel" corpora along

such strict-definition lines.<br><br>

So who is creating large amounts of 'parallel' data (in the

technical/translation sense)<br>

for British English and American English? I wouldn't have thought there

was a very large <br>

market....?<br><br>

Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised

to discover <br>

that publishers are making such changes as<br>

<blockquote type=cite class=cite cite="">   They had drawn for

the house cup<br>

   They had tied for the house cup</blockquote>Perhaps because

it's "children's" literature? Or at least read by many

children, <br>

who may not be willing/able to cross varietal boundaries with total

comfort.<br><br>

But when I read a novel by an American author, I accept that it's part of

my role as reader to <br>

take on board any varietal differences as part of the context. I can't

imagine anyone wanting<br>

to translate it into British English for my benefit, and I suspect I

would hate to read the resulting <br>

text...<br><br>

Best<br>

Ramesh<br><br>

<br>

At 18:53 09/11/2006, Merle Tenney wrote:<br>

<blockquote type=cite class=cite cite="">Ramesh Krishnamurthy wrote:<br>

> <br>

> ...and there is no obvious parallel corpus of Br-Am Eng to

consult...<br>

> Do you know of one by any chance...<br>

> <br>

> And Mark P. Line responded:<br>

> <br>

>Why would it have to be a *parallel* corpus?<br>

 <br>

In a word, alignment.  The formative work in parallel corpora has

come from the machine translation crowd, especially the statistical

machine researchers.  The primary purpose of having a parallel

corpus is to align translationally equivalent documents in two languages,

first at the sentence level, then at the word and phrase level, in order

to establish word and phrase equivalences.  A secondary purpose,

deriving from the sentence-level alignment, is to produce source and

target sentence pairs to prime the pump for translation memory

systems.<br>

 <br>

Like you, I have wondered why you couldn't study two text corpora of

similar but not equivalent texts and compare them in their

totality.  Of course you can, but is there any way in this scenario

to come up with meaningful term-level comparisons, as good as you can get

with parallel corpora?  I can see two ways you might proceed:<br>

 <br>

The first method largely begs the question of term equivalence.  You

begin with a set of known related words and you compare their frequencies

and distributions.  So if you are studying language models, you

compare <i>sheer</i>, <i>complete</i>, and <i>utter </i>as a group. 

If you are studying dialect differences, you study <i>diaper</i> and

<i>nappy</i> or <i>bonnet</i> and <i>hood</i> (clothing and

automotive).  If you are studying translation equivalence in English

and Spanish, you study <i>flag</i>, <i>banner</i>, <i>standard</i>,

<i>pendant</i> alongside <i>bandera</i>, <i>estandarte</i>,

<i>pabellón</i> (and <i>flag</i>, <i>flagstone</i> vs. <i>losa</i>,

<i>lancha</i>; <i>flag</i>, <i>fail,</i> <i>languish</i>, <i>weaken</i>

vs. <i>flaquear</i>, <i>debilitarse</i>, <i>languidecer</i>; etc.). 

The point is, you already have your comparable sets going in, and you

study their usage across a broad corpus.  One problem here is that

you need to have a strong word sense disambiguation component or you need

to work with a word sense-tagged corpus to deal with homophonous and

polysemous terms like <i>sheer</i>, <i>bonnet</i>, <i>flat</i>, and

<i>flag, </i>so you still have some hard work left even if you start with

the related word groups.<br>

 <br>

The second method does not begin, a priori, with sets of related

words.  In fact, generating synonyms, dialectal variants, and

translation equivalents is one of its more interesting challenges. 

Detailed lexical, collocational, and syntactic characterizations is

another.  Again, this is much easier to do if you are working with

parallel corpora.  If you are dealing with large, nonparallel texts,

this is a real challenge.  Other than inflected and lemmatized word

forms, there are a few more hooks that can be applied, including POS

tagging and WSD.  Even if both of these technologies perform well,

however, that is still not enough to get you to the quality of data that

you get with parallel corpora.<br>

 <br>

Mark, if you can figure out a way to combine the quality and quantity of

data from a very large corpus with the alignment and equivalence power of

a parallel corpus without actually having a parallel corpus, I will

personally nominate you for the Nobel Prize in Corpus Linguistics. 

J<br>

 <br>

Merle<br>

 <br>

PS and Shameless Microsoft Plug:  In the last paragraph, I

accidentally typed “figure out a why to combine” and I got the blue

squiggle from Word 2007, which was released to manufacturing on Monday of

this week.  It suggested <i>way</i>, and of course I took the

suggestion.  I am amazed at the number of mistakes that the

contextual speller has caught in my writing since I started using

it.  I recommend the new version of Word and Office for this feature

alone.  J</blockquote>

<x-sigsep><p></x-sigsep>

Ramesh Krishnamurthy<br><br>

Lecturer in English Studies, School of Languages and Social Sciences,

Aston University, Birmingham B4 7ET, UK<br>

[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ;

Fax: +44 (0)121-204-3766<br>

<a href="http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp" eudora="autourl">

http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp<br><br>

</a>Project Leader, ACORN (Aston Corpus Network):

<a href="http://corpus.aston.ac.uk/" eudora="autourl">

http://corpus.aston.ac.uk/</a></body>

</html>