14.1461, Qs: Data Collection; Machine Translation Texts

Wed May 21 19:52:28 UTC 2003

LINGUIST List:  Vol-14-1461. Wed May 21 2003. ISSN: 1068-4875.

Subject: 14.1461, Qs: Data Collection; Machine Translation Texts

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Fox <fox at linguistlist.org>
 ==========================================================================
We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

=================================Directory=================================

1)
Date:  Tue, 20 May 2003 08:25:07 +0000
From:  Hale Isik <halei at fedu.metu.edu.tr>
Subject:  Data Collection Assistance

2)
Date:  Wed, 21 May 2003 13:33:07 +0100 (BST)
From:  D Elliott <debe at comp.leeds.ac.uk>
Subject:  Parallel texts for machine translation evaluation

-------------------------------- Message 1 -------------------------------

Date:  Tue, 20 May 2003 08:25:07 +0000
From:  Hale Isik <halei at fedu.metu.edu.tr>
Subject:  Data Collection Assistance

Dear Members,

I am a graduate student and also work as a research assistant at the
Dept of FLE, METU, Turkey. I am currently writing my M.A thesis which
aims at exploring, in the most general sense, the impact of
self-guiding principles and culture values on communication and how
these are operationalized in the language we use in situationally
defined contexts in Turkish and English.

To carry out a cross-cultural analysis (with Turkish data), I need to
use data compiled from native speakers of English who are university
students and citizens of the UK or USA.

I would be overwhelmingly grateful if you could direct such students
enrolled at your university/department/course to fill out my online
questionnaire which can be accessed via:

http://www.fedu.metu.edu.tr/hale/questionnaire_english.asp

Please accept my sincere thanks and gratitude for your anticipated
support in advance.

Hale Isik

Research Assistant
Department of Foreign Language Education
Middle East Technical University, Ankara, Turkey
E-mail: hisik at metu.edu.tr
        hale at tutor.fedu.metu.edu.tr

Subject-Language: English; Code: ENG

-------------------------------- Message 2 -------------------------------

Date:  Wed, 21 May 2003 13:33:07 +0100 (BST)
From:  D Elliott <debe at comp.leeds.ac.uk>
Subject:  Parallel texts for machine translation evaluation

Dear all

I am collecting parallel texts for a corpus designed specifically for
MT evaluation (to be made available online for research) and would
appreciate any advice on where to find parallel texts of a particular
kind.....

Source texts/extracts of approx. 400 words each in: French, Italian,
German, Spanish, Chinese (Simplified and/or Traditional), Japanese,
Russian and Portuguese.

The challenge is that these must have very good quality human English
translations which can be used as a 'gold standard' against which we
can compare MT output. (NB British English if possible) I am just
beginning to realise how difficult a task I have set myself! (Another
problem is that some multilingual sites are localised to such an
extent that parts have been rewritten rather than translated - doh!)

The kinds of texts in the corpus will represent current MT use. The
following (provisional) categories have been selected, following a
worldwide survey of MT users:

Technical documents (eg. software user manuals, online help, telecoms,
automotive, aerospace)
Correspondence (letter/emails)
Academic papers
Tourist/travel information
Newspaper articles
Medical documents
Scientific documents
Financial documents (stock exchange reports, banking, insurance)
Legal documents (including patents)
Calls for tender
Internal company documents (eg. minutes, training material, company
reports)

Any URLs or other sources (even on paper!) would be gratefully
received.  Sources which do not require copyright permission would
also be a big time-saver. All sources will obviously be acknowledged
in the corpus.

I will post a summary of feedback as soon as the deluge stops (wishful
thinking!)

Debbie Elliott

For more information on the project so far, see:

Elliott, Debbie; Hartley, Anthony; Atwell, Eric. Rationale for a
multilingual corpus for machine translation evaluation in: Archer, D,
Rayson, P, Wilson, A & McEnery, T (editors) Proceedings of CL2003:
International Conference on Corpus Linguistics, pp. 191-200 Lancaster
University. 2003.

***************************************************
Debbie Elliott
Computer Vision and Language Research Group,
School of Computing,
University of Leeds,
Leeds LS2 9JT
United Kingdom.
Email: debe at comp.leeds.ac.uk
***************************************************

---------------------------------------------------------------------------
LINGUIST List: Vol-14-1461