<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Verdana","sans-serif";}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:"Verdana","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoPlainText>Ramesh Krishnamurthy wrote:<o:p></o:p></p>
<p class=MsoPlainText>><o:p> </o:p></p>
<p class=MsoPlainText>> ...and there is no obvious parallel corpus of Br-Am
Eng to consult...<o:p></o:p></p>
<p class=MsoPlainText>> Do you know of one by any chance...<o:p></o:p></p>
<p class=MsoPlainText>><o:p> </o:p></p>
<p class=MsoPlainText>> And Mark P. Line responded:<o:p></o:p></p>
<p class=MsoPlainText><span style='color:black'>><o:p> </o:p></span></p>
<p class=MsoPlainText>>Why would it have to be a *parallel* corpus?<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText><span style='color:black'>In a word, alignment. The formative
work in parallel corpora has come from the machine translation crowd,
especially the statistical machine researchers. The primary purpose of having
a parallel corpus is to align translationally equivalent documents in two
languages, first at the sentence level, then at the word and phrase level, in
order to establish word and phrase equivalences. A secondary purpose, deriving
from the sentence-level alignment, is to produce source and target sentence
pairs to prime the pump for translation memory systems.<o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>Like you, I have wondered why you
couldn't study two text corpora of similar but not equivalent texts and compare
them in their totality. Of course you can, but is there any way in this
scenario to come up with meaningful term-level comparisons, as good as you can
get with parallel corpora? I can see two ways you might proceed:<o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>The first method largely begs
the question of term equivalence. You begin with a set of known related words
and you compare their frequencies and distributions. So if you are studying
language models, you compare <i>sheer</i>, <i>complete</i>, and <i>utter </i>as
a group. If you are studying dialect differences, you study <i>diaper</i> and <i>nappy</i>
or <i>bonnet</i> and <i>hood</i> (clothing and automotive). If you are
studying translation equivalence in English and Spanish, you study <i>flag</i>,
<i>banner</i>, <i>standard</i>, <i>pendant</i> alongside <i>bandera</i>, <i>estandarte</i>,
<i>pabellón</i> (and <i>flag</i>, <i>flagstone</i> vs. <i>losa</i>, <i>lancha</i>;
<i>flag</i>, <i>fail,</i> <i>languish</i>, <i>weaken</i> vs. <i>flaquear</i>, <i>debilitarse</i>,
<i>languidecer</i>; etc.). The point is, you already have your comparable sets
going in, and you study their usage across a broad corpus. One problem here is
that you need to have a strong word sense disambiguation component or you need
to work with a word sense-tagged corpus to deal with homophonous and polysemous
terms like <i>sheer</i>, <i>bonnet</i>, <i>flat</i>, and <i>flag, </i>so you
still have some hard work left even if you start with the related word groups.<o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>The second method does not
begin, a priori, with sets of related words. In fact, generating synonyms,
dialectal variants, and translation equivalents is one of its more interesting challenges.
Detailed lexical, collocational, and syntactic characterizations is another.
Again, this is much easier to do if you are working with parallel corpora. If
you are dealing with large, nonparallel texts, this is a real challenge. Other
than inflected and lemmatized word forms, there are a few more hooks that can
be applied, including POS tagging and WSD. Even if both of these technologies
perform well, however, that is still not enough to get you to the quality of
data that you get with parallel corpora.<o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>Mark, if you can figure out a way
to combine the quality and quantity of data from a very large corpus with the alignment
and equivalence power of a parallel corpus without actually having a parallel
corpus, I will personally nominate you for the Nobel Prize in Corpus
Linguistics. </span><span style='font-family:Wingdings;color:black'>J</span><span
style='color:black'><o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>Merle<o:p></o:p></span></p>
<p class=MsoPlainText><span style='color:black'><o:p> </o:p></span></p>
<p class=MsoPlainText><span style='color:black'>PS and Shameless Microsoft Plug:
In the last paragraph, I accidentally typed “figure out a why to combine”
and I got the blue squiggle from Word 2007, which was released to manufacturing
on Monday of this week. It suggested <i>way</i>, and of course I took the
suggestion. I am amazed at the number of mistakes that the contextual speller
has caught in my writing since I started using it. I recommend the new version
of Word and Office for this feature alone. </span><span style='font-family:
Wingdings;color:black'>J</span><span style='color:black'><o:p></o:p></span></p>
</div>
</body>
</html>