[Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

Alex Murzaku lissus at gmail.com
Fri Nov 10 15:26:36 UTC 2006


At AMTA2006 Jaime Carbonell et al. presented a paper "Context-Based Machine
Translation" describing an MT system Fluent/Meaningful Machines has been
designing these last few years. I happen to have seen it in action a few
years back when it was just a prototype and it performed remarkably well.
The reason I am bringing it up in this forum is that the knowledge it uses
for the translation process is not based in parallel corpora but in very
large monolingual corpora. This approach not only provided the necessary
tools for the translation itself but also created a slew of other tools very
useful to text mining etc.

There are several patents filed by Eli Abir (the original inventor) that
cover this suite of products:
http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=TR200402395T&F=0&QPN=TR200402395T
where are described solutions to exactly the kind of situation you are
wondering about.

All the best,

Alex Murzkau

On 11/10/06, Ramesh Krishnamurthy <r.krishnamurthy at aston.ac.uk> wrote:
>
> Hi Merle
> I must admit I hadn't been thinking of "parallel" corpora along such
> strict-definition lines.
>
> So who is creating large amounts of 'parallel' data (in the
> technical/translation sense)
> for British English and American English? I wouldn't have thought there
> was a very large
> market....?
>
> Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised to
> discover
> that publishers are making such changes as
>
>    They had drawn for the house cup
>    They had tied for the house cup
>
> Perhaps because it's "children's" literature? Or at least read by many
> children,
> who may not be willing/able to cross varietal boundaries with total
> comfort.
>
> But when I read a novel by an American author, I accept that it's part of
> my role as reader to
> take on board any varietal differences as part of the context. I can't
> imagine anyone wanting
> to translate it into British English for my benefit, and I suspect I would
> hate to read the resulting
> text...
>
> Best
> Ramesh
>
>
> At 18:53 09/11/2006, Merle Tenney wrote:
>
> Ramesh Krishnamurthy wrote:
> >
> > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
> > Do you know of one by any chance...
> >
> > And Mark P. Line responded:
> >
> >Why would it have to be a *parallel* corpus?
>
> In a word, alignment.  The formative work in parallel corpora has come
> from the machine translation crowd, especially the statistical machine
> researchers.  The primary purpose of having a parallel corpus is to align
> translationally equivalent documents in two languages, first at the sentence
> level, then at the word and phrase level, in order to establish word and
> phrase equivalences.  A secondary purpose, deriving from the sentence-level
> alignment, is to produce source and target sentence pairs to prime the pump
> for translation memory systems.
>
> Like you, I have wondered why you couldn't study two text corpora of
> similar but not equivalent texts and compare them in their totality.  Of
> course you can, but is there any way in this scenario to come up with
> meaningful term-level comparisons, as good as you can get with parallel
> corpora?  I can see two ways you might proceed:
>
> The first method largely begs the question of term equivalence.  You begin
> with a set of known related words and you compare their frequencies and
> distributions.  So if you are studying language models, you compare *sheer
> *, *complete*, and *utter *as a group.  If you are studying dialect
> differences, you study *diaper* and *nappy* or *bonnet* and *hood*(clothing and automotive).  If you are studying translation equivalence in
> English and Spanish, you study *flag*, *banner*, *standard*, *pendant*alongside
> *bandera*, *estandarte*, *pabellón* (and *flag*, *flagstone* vs. *losa*, *
> lancha*; *flag*, *fail,* *languish*, *weaken* vs. *flaquear*, *debilitarse
> *, *languidecer*; etc.).  The point is, you already have your comparable
> sets going in, and you study their usage across a broad corpus.  One problem
> here is that you need to have a strong word sense disambiguation component
> or you need to work with a word sense-tagged corpus to deal with homophonous
> and polysemous terms like *sheer*, *bonnet*, *flat*, and *flag, *so you
> still have some hard work left even if you start with the related word
> groups.
>
> The second method does not begin, a priori, with sets of related words.
> In fact, generating synonyms, dialectal variants, and translation
> equivalents is one of its more interesting challenges.  Detailed lexical,
> collocational, and syntactic characterizations is another.  Again, this is
> much easier to do if you are working with parallel corpora.  If you are
> dealing with large, nonparallel texts, this is a real challenge.  Other than
> inflected and lemmatized word forms, there are a few more hooks that can be
> applied, including POS tagging and WSD.  Even if both of these technologies
> perform well, however, that is still not enough to get you to the quality of
> data that you get with parallel corpora.
>
> Mark, if you can figure out a way to combine the quality and quantity of
> data from a very large corpus with the alignment and equivalence power of a
> parallel corpus without actually having a parallel corpus, I will personally
> nominate you for the Nobel Prize in Corpus Linguistics.  J
>
> Merle
>
> PS and Shameless Microsoft Plug:  In the last paragraph, I accidentally
> typed "figure out a why to combine" and I got the blue squiggle from Word
> 2007, which was released to manufacturing on Monday of this week.  It
> suggested *way*, and of course I took the suggestion.  I am amazed at the
> number of mistakes that the contextual speller has caught in my writing
> since I started using it.  I recommend the new version of Word and Office
> for this feature alone.  J
>
> Ramesh Krishnamurthy
>
> Lecturer in English Studies, School of Languages and Social Sciences,
> Aston University, Birmingham B4 7ET, UK
> [Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax:
> +44 (0)121-204-3766
> http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
>
> Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061110/62cbaf85/attachment.htm>


More information about the Corpora mailing list