<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:D="DAV:" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns="http://www.w3.org/TR/REC-html40">


<head>

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p

        {mso-style-priority:99;

        mso-margin-top-alt:auto;

        margin-right:0in;

        mso-margin-bottom-alt:auto;

        margin-left:0in;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

span.EmailStyle18

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page Section1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.Section1

        {page:Section1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>


<body lang=EN-US link=blue vlink=purple>


<div class=Section1>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>Ramesh,<o:p></o:p></span></p>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>Lots of people are working with parallel corpora in two or more

languages.  Honestly, I don’t know of any effort to acquire parallel

corpora of two or more varieties of English, French, Portuguese, etc.  I should

think that sources for such corpora must exist, though not nearly to the extent

that they exist for texts in different languages.  Another variant on the

parallel corpus theme is papers written by English language learners and the

corrected versions with interference problems removed.  Again, it is not hard

to imagine that such sources exist, but I cannot provide a reference for either

sort of same-language corpus.  Can someone point Ramesh and me in the right

direction?<o:p></o:p></span></p>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>Merle<o:p></o:p></span></p>


<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<div>


<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>


<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span

style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Ramesh

Krishnamurthy [mailto:r.krishnamurthy@aston.ac.uk] <br>

<b>Sent:</b> Friday, November 10, 2006 6:46 AM<br>

<b>To:</b> Merle Tenney; Mark P. Line; CORPORA@UIB.NO<br>

<b>Subject:</b> Re: [Corpora-List] Parallel corpora and word alignment, WAS:

American and British English spelling converter<o:p></o:p></span></p>


</div>


</div>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>Hi Merle<br>

I must admit I hadn't been thinking of "parallel" corpora along such

strict-definition lines.<br>

<br>

So who is creating large amounts of 'parallel' data (in the technical/translation

sense)<br>

for British English and American English? I wouldn't have thought there was a

very large <br>

market....?<br>

<br>

Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised to

discover <br>

that publishers are making such changes as<br>

<br>

<o:p></o:p></p>


<p class=MsoNormal>   They had drawn for the house cup<br>

   They had tied for the house cup<o:p></o:p></p>


<p class=MsoNormal>Perhaps because it's "children's" literature? Or

at least read by many children, <br>

who may not be willing/able to cross varietal boundaries with total comfort.<br>

<br>

But when I read a novel by an American author, I accept that it's part of my

role as reader to <br>

take on board any varietal differences as part of the context. I can't imagine

anyone wanting<br>

to translate it into British English for my benefit, and I suspect I would hate

to read the resulting <br>

text...<br>

<br>

Best<br>

Ramesh<br>

<br>

<br>

At 18:53 09/11/2006, Merle Tenney wrote:<br>

<br>

<o:p></o:p></p>


<p class=MsoNormal>Ramesh Krishnamurthy wrote:<br>

> <br>

> ...and there is no obvious parallel corpus of Br-Am Eng to consult...<br>

> Do you know of one by any chance...<br>

> <br>

> And Mark P. Line responded:<br>

> <br>

>Why would it have to be a *parallel* corpus?<br>

 <br>

In a word, alignment.  The formative work in parallel corpora has come

from the machine translation crowd, especially the statistical machine

researchers.  The primary purpose of having a parallel corpus is to align

translationally equivalent documents in two languages, first at the sentence

level, then at the word and phrase level, in order to establish word and phrase

equivalences.  A secondary purpose, deriving from the sentence-level

alignment, is to produce source and target sentence pairs to prime the pump for

translation memory systems.<br>

 <br>

Like you, I have wondered why you couldn't study two text corpora of similar

but not equivalent texts and compare them in their totality.  Of course

you can, but is there any way in this scenario to come up with meaningful

term-level comparisons, as good as you can get with parallel corpora?  I

can see two ways you might proceed:<br>

 <br>

The first method largely begs the question of term equivalence.  You begin

with a set of known related words and you compare their frequencies and

distributions.  So if you are studying language models, you compare <i>sheer</i>,

<i>complete</i>, and <i>utter </i>as a group.  If you are studying dialect

differences, you study <i>diaper</i> and <i>nappy</i> or <i>bonnet</i> and <i>hood</i>

(clothing and automotive).  If you are studying translation equivalence in

English and Spanish, you study <i>flag</i>, <i>banner</i>, <i>standard</i>, <i>pendant</i>

alongside <i>bandera</i>, <i>estandarte</i>, <i>pabellón</i> (and <i>flag</i>, <i>flagstone</i>

vs. <i>losa</i>, <i>lancha</i>; <i>flag</i>, <i>fail,</i> <i>languish</i>, <i>weaken</i>

vs. <i>flaquear</i>, <i>debilitarse</i>, <i>languidecer</i>; etc.).  The

point is, you already have your comparable sets going in, and you study their

usage across a broad corpus.  One problem here is that you need to have a

strong word sense disambiguation component or you need to work with a word

sense-tagged corpus to deal with homophonous and polysemous terms like <i>sheer</i>,

<i>bonnet</i>, <i>flat</i>, and <i>flag, </i>so you still have some hard work

left even if you start with the related word groups.<br>

 <br>

The second method does not begin, a priori, with sets of related words. 

In fact, generating synonyms, dialectal variants, and translation equivalents

is one of its more interesting challenges.  Detailed lexical,

collocational, and syntactic characterizations is another.  Again, this is

much easier to do if you are working with parallel corpora.  If you are

dealing with large, nonparallel texts, this is a real challenge.  Other

than inflected and lemmatized word forms, there are a few more hooks that can

be applied, including POS tagging and WSD.  Even if both of these

technologies perform well, however, that is still not enough to get you to the

quality of data that you get with parallel corpora.<br>

 <br>

Mark, if you can figure out a way to combine the quality and quantity of data

from a very large corpus with the alignment and equivalence power of a parallel

corpus without actually having a parallel corpus, I will personally nominate

you for the Nobel Prize in Corpus Linguistics.  J<br>

 <br>

Merle<br>

 <br>

PS and Shameless Microsoft Plug:  In the last paragraph, I accidentally typed

“figure out a why to combine” and I got the blue squiggle from Word

2007, which was released to manufacturing on Monday of this week.  It

suggested <i>way</i>, and of course I took the suggestion.  I am amazed at

the number of mistakes that the contextual speller has caught in my writing

since I started using it.  I recommend the new version of Word and Office

for this feature alone.  J<o:p></o:p></p>


<p>Ramesh Krishnamurthy<br>

<br>

Lecturer in English Studies, School of Languages and Social Sciences, Aston

University, Birmingham B4 7ET, UK<br>

[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax: +44

(0)121-204-3766<br>

<a href="http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp">http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp<br>

<br>

</a>Project Leader, ACORN (Aston Corpus Network): <a

href="http://corpus.aston.ac.uk/">http://corpus.aston.ac.uk/</a><o:p></o:p></p>


</div>


</body>


</html>