[Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200

Jörg Tiedemann joerg at stp.ling.uu.se
Wed Sep 11 16:15:13 UTC 2002


I don't know of any single article which summarises the terminology with
regards to parallel corpora but from my experience some of the
differences are the following:

* bilingual corpora are strictly two languages
* a translation corpus should contain the original version and at least
  one translation (but not necessarily only one)
* a parallel corpus contains translations of a common source but they do
  not need to include the original version (even if this sounds strange -
  I know of parallel corpora e.g. from the EU which do not indicate the
  original version and I used to work with some of them without
  knowing/using the original or intermediate documents)
* parallel corpora should be aligned to some extent to make them
  searchable within linked segments, alignment can be done e.g. on
  paragraphs or sentences (translation corpora do not have to be aligned I
  would say)
* comparable corpora are two or more corpora with similar size and from
  similar domains. usually people assume similar distribution of
  words/phrases in comparable corpora in order to compare them. They do
  not have to be parallel (or translations of each other)
* comparable and parallel corpora do not have to include multiple
  languages whereas translation corpora should
* sometimes I use another term for bilingual parallel corpora: bitexts -
  just to make it shorter. in this case, aligned segments within such
  corpora will be bitext segments


I hope this helped a bit and did not create even more confusion,


best regards,



Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Joerg Tiedemann                 joerg at stp.ling.uu.se           **
**  Department of Linguistics    http://stp.ling.uu.se/~joerg/     **
**  Uppsala University               tel: (018) 471 7007           **
**  S-751 20 Uppsala/SWEDEN          fax: (018) 471 1416           **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********




On Wed, 11 Sep 2002 maria_rzewuska at mail.ukie.gov.pl wrote:

> Hi, I have been reading the list for a while and lately I took a closer
> look at some bilingual corpus projects and I noticed a relatively flexible
> use of terms: translation corpus, parallel corpus, comaparable corpus, but
> mainly between the two first. Maybe someone could tell me is there any
> difference or is it simply mixed up. In the composition of the corpora I
> did not find any difference which could explain the terminological
> difference. Any book or clever article that I should read?
> thanks
>
> Maria Rzewuska
> Adam Mickiewicz University
> Poznan
> PL
>
>



More information about the Corpora mailing list