PD: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200

Rafał Górski RafalG at ijp-pan.krakow.pl
Thu Sep 12 12:55:27 UTC 2002


Dear Maria and Joerg
In fact there is a lot of confusion in the terminology. Joerg writes

>* a translation corpus should contain the original version and at least
>one translation (but not necessarily only one)

on the other hand Enery & Wilson "Corpus Linguistics" (2nd edition 2001) p.
70:
"translation corpora differ from parallel corpora, as they do not represent
text in translation. Rather they allow one to compare, for example, L1
French texts in one genre with L1 English  texts in the same genre." The
authors treat "translation" and "comparable" as synonims (however they give
preference to the former using it in the body of the text; the term
"comparable" is given only in a footnote).

Sinclair: "A comparable corpus is one which selects similar texts in more
than one language or variety."  EAGLES Preliminary recommendations on Corpus
Typology. Version of May, 1996
http://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html
Note however that Sinclair calls International Corpus of English a
"comparable corpus". In this case you cannot treat "comparable" and
"translation" as equivalents!

> * parallel corpora should be aligned to some extent to make them
>   searchable within linked segments, alignment can be done e.g. on
>   paragraphs or sentences (translation corpora do not have to be aligned I
>   would say)
John Sinclair in: EAGLES Preliminary recommendations... defines:
"A parallel corpus is a collection of texts, each of which is translated
into one or more other languages than the original."
McEnery & Wilson (2001) and Sinclair suggest that parallel corpora are not
necessarly aligned, although they admit that a parallel corpus with no
alignement is a bit strange (see section 2.3.1.)

I admit that the term "translation corpus" is confusing: you would rather
understand it as a "corpus of translations" than "corpus for translators" or
"used mainly by translators" (which is the right interpretation).

Rafal L. Górski

----- Original Message -----
From: Jörg Tiedemann <joerg at stp.ling.uu.se>
To: <maria_rzewuska at mail.ukie.gov.pl>
Cc: <corpora at hd.uib.no>
Sent: Wednesday, September 11, 2002 6:15 PM
Subject: Re: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200


>
>
> I don't know of any single article which summarises the terminology with
> regards to parallel corpora but from my experience some of the
> differences are the following:
>
> * bilingual corpora are strictly two languages
> * a parallel corpus contains translations of a common source but they do
>   not need to include the original version (even if this sounds strange -
>   I know of parallel corpora e.g. from the EU which do not indicate the
>   original version and I used to work with some of them without
>   knowing/using the original or intermediate documents)
> * parallel corpora should be aligned to some extent to make them
>   searchable within linked segments, alignment can be done e.g. on
>   paragraphs or sentences (translation corpora do not have to be aligned I
>   would say)
> * comparable corpora are two or more corpora with similar size and from
>   similar domains. usually people assume similar distribution of
>   words/phrases in comparable corpora in order to compare them. They do
>   not have to be parallel (or translations of each other)
> * comparable and parallel corpora do not have to include multiple
>   languages whereas translation corpora should
> * sometimes I use another term for bilingual parallel corpora: bitexts -
>   just to make it shorter. in this case, aligned segments within such
>   corpora will be bitext segments
>
>
> I hope this helped a bit and did not create even more confusion,
>
>
> best regards,
>
>
>
> Jörg
>
> ***********/\/\/\/\/\/\/\/\/\/\/\************************************
> **  Joerg Tiedemann                 joerg at stp.ling.uu.se           **
> **  Department of Linguistics    http://stp.ling.uu.se/~joerg/     **
> **  Uppsala University               tel: (018) 471 7007           **
> **  S-751 20 Uppsala/SWEDEN          fax: (018) 471 1416           **
> *************************************/\/\/\/\/\/\/\/\/\/\/\**********
>
>
>
>
> On Wed, 11 Sep 2002 maria_rzewuska at mail.ukie.gov.pl wrote:
>
> > Hi, I have been reading the list for a while and lately I took a closer
> > look at some bilingual corpus projects and I noticed a relatively
flexible
> > use of terms: translation corpus, parallel corpus, comaparable corpus,
but
> > mainly between the two first. Maybe someone could tell me is there any
> > difference or is it simply mixed up. In the composition of the corpora I
> > did not find any difference which could explain the terminological
> > difference. Any book or clever article that I should read?
> > thanks
> >
> > Maria Rzewuska
> > Adam Mickiewicz University
> > Poznan
> > PL
> >
> >
>
>
>
>



More information about the Corpora mailing list