<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:Wingdings;

        panose-1:5 0 0 0 0 0 0 0 0 0;}

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Verdana;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri","sans-serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoPlainText, li.MsoPlainText, div.MsoPlainText

        {mso-style-priority:99;

        mso-style-link:"Plain Text Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Verdana","sans-serif";}

span.PlainTextChar

        {mso-style-name:"Plain Text Char";

        mso-style-priority:99;

        mso-style-link:"Plain Text";

        font-family:"Verdana","sans-serif";}

.MsoChpDefault

        {mso-style-type:export-only;}

@page Section1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.Section1

        {page:Section1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>


<body lang=EN-US link=blue vlink=purple>


<div class=Section1>


<p class=MsoPlainText>Ramesh Krishnamurthy wrote:<o:p></o:p></p>


<p class=MsoPlainText>><o:p> </o:p></p>


<p class=MsoPlainText>> ...and there is no obvious parallel corpus of Br-Am

Eng to consult...<o:p></o:p></p>


<p class=MsoPlainText>> Do you know of one by any chance...<o:p></o:p></p>


<p class=MsoPlainText>><o:p> </o:p></p>


<p class=MsoPlainText>> And Mark P. Line responded:<o:p></o:p></p>


<p class=MsoPlainText><span style='color:black'>><o:p> </o:p></span></p>


<p class=MsoPlainText>>Why would it have to be a *parallel* corpus?<o:p></o:p></p>


<p class=MsoPlainText><o:p> </o:p></p>


<p class=MsoPlainText><span style='color:black'>In a word, alignment.  The formative

work in parallel corpora has come from the machine translation crowd,

especially the statistical machine researchers.  The primary purpose of having

a parallel corpus is to align translationally equivalent documents in two

languages, first at the sentence level, then at the word and phrase level, in

order to establish word and phrase equivalences.  A secondary purpose, deriving

from the sentence-level alignment, is to produce source and target sentence

pairs to prime the pump for translation memory systems.<o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>Like you, I have wondered why you

couldn't study two text corpora of similar but not equivalent texts and compare

them in their totality.  Of course you can, but is there any way in this

scenario to come up with meaningful term-level comparisons, as good as you can

get with parallel corpora?  I can see two ways you might proceed:<o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>The first method largely begs

the question of term equivalence.  You begin with a set of known related words

and you compare their frequencies and distributions.  So if you are studying

language models, you compare <i>sheer</i>, <i>complete</i>, and <i>utter </i>as

a group.  If you are studying dialect differences, you study <i>diaper</i> and <i>nappy</i>

or <i>bonnet</i> and <i>hood</i> (clothing and automotive).  If you are

studying translation equivalence in English and Spanish, you study <i>flag</i>,

<i>banner</i>, <i>standard</i>, <i>pendant</i> alongside <i>bandera</i>, <i>estandarte</i>,

<i>pabellón</i> (and <i>flag</i>, <i>flagstone</i> vs. <i>losa</i>, <i>lancha</i>;

<i>flag</i>, <i>fail,</i> <i>languish</i>, <i>weaken</i> vs. <i>flaquear</i>, <i>debilitarse</i>,

<i>languidecer</i>; etc.).  The point is, you already have your comparable sets

going in, and you study their usage across a broad corpus.  One problem here is

that you need to have a strong word sense disambiguation component or you need

to work with a word sense-tagged corpus to deal with homophonous and polysemous

terms like <i>sheer</i>, <i>bonnet</i>, <i>flat</i>, and <i>flag, </i>so you

still have some hard work left even if you start with the related word groups.<o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>The second method does not

begin, a priori, with sets of related words.  In fact, generating synonyms,

dialectal variants, and translation equivalents is one of its more interesting challenges. 

Detailed lexical, collocational, and syntactic characterizations is another. 

Again, this is much easier to do if you are working with parallel corpora.  If

you are dealing with large, nonparallel texts, this is a real challenge.  Other

than inflected and lemmatized word forms, there are a few more hooks that can

be applied, including POS tagging and WSD.  Even if both of these technologies

perform well, however, that is still not enough to get you to the quality of

data that you get with parallel corpora.<o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>Mark, if you can figure out a way

to combine the quality and quantity of data from a very large corpus with the alignment

and equivalence power of a parallel corpus without actually having a parallel

corpus, I will personally nominate you for the Nobel Prize in Corpus

Linguistics.  </span><span style='font-family:Wingdings;color:black'>J</span><span

style='color:black'><o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>Merle<o:p></o:p></span></p>


<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>


<p class=MsoPlainText><span style='color:black'>PS and Shameless Microsoft Plug: 

In the last paragraph, I accidentally typed “figure out a why to combine”

and I got the blue squiggle from Word 2007, which was released to manufacturing

on Monday of this week.  It suggested <i>way</i>, and of course I took the

suggestion.  I am amazed at the number of mistakes that the contextual speller

has caught in my writing since I started using it.  I recommend the new version

of Word and Office for this feature alone.  </span><span style='font-family:

Wingdings;color:black'>J</span><span style='color:black'><o:p></o:p></span></p>


</div>


</body>


</html>