[Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter
Ramesh Krishnamurthy
r.krishnamurthy at aston.ac.uk
Fri Nov 10 14:45:52 UTC 2006
Hi Merle
I must admit I hadn't been thinking of "parallel"
corpora along such strict-definition lines.
So who is creating large amounts of 'parallel'
data (in the technical/translation sense)
for British English and American English? I
wouldn't have thought there was a very large
market....?
Noah Smith mentioned Harry Potter, and I must
admit I'm quite surprised to discover
that publishers are making such changes as
> They had drawn for the house cup
> They had tied for the house cup
Perhaps because it's "children's" literature? Or
at least read by many children,
who may not be willing/able to cross varietal boundaries with total comfort.
But when I read a novel by an American author, I
accept that it's part of my role as reader to
take on board any varietal differences as part of
the context. I can't imagine anyone wanting
to translate it into British English for my
benefit, and I suspect I would hate to read the resulting
text...
Best
Ramesh
At 18:53 09/11/2006, Merle Tenney wrote:
>Ramesh Krishnamurthy wrote:
> >
> > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
> > Do you know of one by any chance...
> >
> > And Mark P. Line responded:
> >
> >Why would it have to be a *parallel* corpus?
>
>In a word, alignment. The formative work in
>parallel corpora has come from the machine
>translation crowd, especially the statistical
>machine researchers. The primary purpose of
>having a parallel corpus is to align
>translationally equivalent documents in two
>languages, first at the sentence level, then at
>the word and phrase level, in order to establish
>word and phrase equivalences. A secondary
>purpose, deriving from the sentence-level
>alignment, is to produce source and target
>sentence pairs to prime the pump for translation memory systems.
>
>Like you, I have wondered why you couldn't study
>two text corpora of similar but not equivalent
>texts and compare them in their totality. Of
>course you can, but is there any way in this
>scenario to come up with meaningful term-level
>comparisons, as good as you can get with
>parallel corpora? I can see two ways you might proceed:
>
>The first method largely begs the question of
>term equivalence. You begin with a set of known
>related words and you compare their frequencies
>and distributions. So if you are studying
>language models, you compare sheer, complete,
>and utter as a group. If you are studying
>dialect differences, you study diaper and nappy
>or bonnet and hood (clothing and
>automotive). If you are studying translation
>equivalence in English and Spanish, you study
>flag, banner, standard, pendant alongside
>bandera, estandarte, pabellón (and flag,
>flagstone vs. losa, lancha; flag, fail,
>languish, weaken vs. flaquear, debilitarse,
>languidecer; etc.). The point is, you already
>have your comparable sets going in, and you
>study their usage across a broad corpus. One
>problem here is that you need to have a strong
>word sense disambiguation component or you need
>to work with a word sense-tagged corpus to deal
>with homophonous and polysemous terms like
>sheer, bonnet, flat, and flag, so you still have
>some hard work left even if you start with the related word groups.
>
>The second method does not begin, a priori, with
>sets of related words. In fact, generating
>synonyms, dialectal variants, and translation
>equivalents is one of its more interesting
>challenges. Detailed lexical, collocational,
>and syntactic characterizations is
>another. Again, this is much easier to do if
>you are working with parallel corpora. If you
>are dealing with large, nonparallel texts, this
>is a real challenge. Other than inflected and
>lemmatized word forms, there are a few more
>hooks that can be applied, including POS tagging
>and WSD. Even if both of these technologies
>perform well, however, that is still not enough
>to get you to the quality of data that you get with parallel corpora.
>
>Mark, if you can figure out a way to combine the
>quality and quantity of data from a very large
>corpus with the alignment and equivalence power
>of a parallel corpus without actually having a
>parallel corpus, I will personally nominate you
>for the Nobel Prize in Corpus Linguistics. J
>
>Merle
>
>PS and Shameless Microsoft Plug: In the last
>paragraph, I accidentally typed figure out a
>why to combine and I got the blue squiggle from
>Word 2007, which was released to manufacturing
>on Monday of this week. It suggested way, and
>of course I took the suggestion. I am amazed at
>the number of mistakes that the contextual
>speller has caught in my writing since I started
>using it. I recommend the new version of Word
>and Office for this feature alone. J
Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages
and Social Sciences, Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel:
+44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061110/b77ae39c/attachment.htm>
More information about the Corpora
mailing list