ELL: re: Machine translation for Akha

Mon Jul 19 13:45:35 UTC 1999

id VAA10413
To: owner-endangered-languages-l at carmen.murdoch.edu.au
Precedence: bulk
Reply-To: endangered-languages-l at carmen.murdoch.edu.au

*** EOOH ***
Return-Path: <owner-endangered-languages-l at carmen.murdoch.edu.au>
X-Authentication-Warning: carmen.murdoch.edu.au: majodomo set sender to
owner-endangered-languages-l at carmen.murdoch.edu.au using -f
X-Sender: jeff!elda.fr at 192.168.1.1
Date: Mon, 19 Jul 1999 15:45:35 +0200
To: endangered-languages-l at carmen.murdoch.edu.au
From: Jeff ALLEN <jeff at elda.fr>
Subject: ELL: re: Machine translation for Akha
Content-Type: text/plain; charset="iso-8859-1"
X-MIME-Autoconverted: from quoted-printable to 8bit by carmen.murdoch.edu.au
id VAA10413
Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
Precedence: bulk
Reply-To: endangered-languages-l at carmen.murdoch.edu.au

At 14:30 15/07/99 +0200, Trond Trosterud wrote:
>In his comment upon my postin on MT for Akha Jeff reveals that I know more
>about morphology than about MT, which is certainly true.

Not trying to necessarily say this, but rather to point out that different
types of MT systems produce output according to different levels of
processing.

>When it comes to choosing between a quick and bad or a long-term and good
>MT, the answer may be dependent upon what lgs we are dealing with.

Yes, this can be a factor.   Some important questions to consider are:

1) type of translation needed
2) type of user
3) type of documentation

take a look at the different white papers available at
www.languagepartners.com
on the topic of choosing CAT and MT tools.  They are well written, not too
technical, and very informative.

>If I
>know a lot on a subject (say min lg issues) and want to get an overview
>over japanese research on this issue in Siberia (say), then I want #any# MT
>system for Japanese now, right away, and I will accept almost anything.

This is called the "getting the gist" inbound translation approach.

>But
>if I, like the Swedish truck company Scania, want to translate user manuals
>into several lgs, obviously the standards are higher.

This is called the "high quality" outbound translation approach.

Read more on the Language Partners International (LPI) site about this.

I have written up supplementary information along these lines, but can be
contacted off line for such information once you have read through the
introductory material available at the LPI site.

>The Example-Based MT system that jeff refers to seem to work better the
>more examples there are to base it upon.

Yes and no.   At Carnegie Mellon University, we were doing studies on the
number of terms and sentences needed to develop a good system. One of our more
comprehensive papers on the topic is:

ALLEN, Jeffrey and Christopher HOGAN. 1998.  Expanding lexical coverage of
parallel corpora for the Example-Based Machine Translation approach. In
Proceedings of the First International Conference on Language Resources and
Evaluation, 28-30 May 1998, Granada, Spain. Vol. 2, pp. 747-754.

It is technical enough to keep the researchers interested, along with
descriptive prose that helps non-technical people understand the purpose of
the
systems and the evidence of how it works.

>One way to get such material is to gather parallel corpora (PC). PCs are
>when e.g. a novel and its translations are aligned, sentence -by-sentence.
>As you can imagine, sentence 19865 in the original does not correspond to
>sentence 19865 in the translation, so things must be done.

I'm currently editing several articles on Translation Memory to appear in
upcoming issues of the ELRA Newsletter that discuss new enhancements in
Translation Memory applications. Subsentence level analysis and processing is
increasing the accuracy of the search/replacements sequences and is making the
work more productive for users.

>There exists software for that:
>
>http://www.hf.uio.no/iba/prosjekt/
>
>When the texts are aligned, the result can be used to extract translation
>equivalents. Handy for terminology work (software is underway that extracts
>not only new terms but their translations, in a very sofisticated way).

Yes, it would be very helpful for standardizing and cleaning up vocabulary and
terminology databases.

>Whether it helps for EBMT as well I do not know, but the tone in jeff.s
>posting was quite optimistic.
>
>The morphological ground work that I described in my last posting is of
>course not that urgent when morphology is marginal or not existing (some
>m-processes usually exist, though).
>
>Anyway, The S.mi lgs do not have any MT systems, and although they
>obviously will need them to fulfil future bilingual legislations, there are
>more than enough undone tasks before that. Transferred to an Akha setting:
>
>A large, good dictionary Thai-Akha
>Consensus on Akha orthography, or at least knowledge of competing systems
>Conversion routines for changing text from one orthography into another
>Spel chackers (sic!) and hyphenation programs
>Standardised keyboard layout and encoding (for orth. outside ascii)
>html editing tools (for orth. outside ascii)

Yes, yes, and yes and yes again.

Thanks for the discussion on this topic.

Best,

Jeff
=================================================
Jeff ALLEN - Technical Manager/Directeur Technique
European Language Resources Association (ELRA)  &
European Language resources - Distribution Agency (ELDA)
(Agence Europe'enne de Distribution des Ressources Linguistiques)
55, rue Brillat-Savarin
75013   Paris   FRANCE
Tel: (+33) 1.43.13.33.33 - Fax: (+33) 1.43.13.33.30
mailto:jeff at elda.fr
http://www.icp.grenet.fr/ELRA/home.html
----
Endangered-Languages-L Forum: endangered-languages-l at carmen.murdoch.edu.au
Web pages http://carmen.murdoch.edu.au/lists/endangered-languages-l/
Subscribe/unsubscribe and other commands: majordomo at carmen.murdoch.edu.au
----