[Corpora-List] Parallel texts for machine translation evaluation

D Elliott debe at comp.leeds.ac.uk
Wed May 21 10:04:27 UTC 2003


Dear all

I am collecting parallel texts for a corpus designed specifically for MT
evaluation (to be made available online for research) and would appreciate
any advice on where to find parallel texts of a particular kind.....

Source texts/extracts of approx. 400 words each in:
French, Italian, German, Spanish, Chinese (Simplified and/or Traditional),
Japanese, Russian and Portuguese.

The challenge is that these must have very good quality human English
translations which can be used as a 'gold standard' against which we
can compare MT output. (NB British English if possible) I am just
beginning to realise how difficult a task I have set myself! (Another
problem is that some multilingual sites are localised to such an extent
that parts have been rewritten rather than translated - doh!)

The kinds of texts in the corpus will represent current MT use. The
following (provisional) categories have been selected, following a
worldwide survey of MT users:

Technical documents (eg. software user manuals, online help, telecoms,
automotive, aerospace)
Correspondence (letter/emails)
Academic papers
Tourist/travel information
Newspaper articles
Medical documents
Scientific documents
Financial documents (stock exchange reports, banking, insurance)
Legal documents (including patents)
Calls for tender
Internal company documents (eg. minutes, training material, company
reports)

Any URLs or other sources (even on paper!) would be gratefully received.
Sources which do not require copyright permission would also be a big
time-saver. All sources will obviously be acknowledged in the corpus.

I will post a summary of feedback as soon as the deluge stops (wishful
thinking!)

Debbie Elliott

For more information on the project so far, see:
Elliott, Debbie; Hartley, Anthony; Atwell, Eric. Rationale for a
multilingual corpus for machine translation evaluation in: Archer,
D, Rayson, P, Wilson, A & McEnery, T (editors) Proceedings of CL2003:
International Conference on Corpus Linguistics, pp. 191-200 Lancaster
University. 2003.



***************************************************
Debbie Elliott
Computer Vision and Language Research Group,
School of Computing,
University of Leeds,
Leeds LS2 9JT
United Kingdom.
Tel: 0113 3436818
Email: debe at comp.leeds.ac.uk
***************************************************



More information about the Corpora mailing list