ELL: Min lgs, lg technology and cost, incl. a success story on Southern Saami Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by carmen.murdoch.edu.au id QAA26888 Sender: owner-endangered-languages-l at carmen.murdoch.edu.au Precedence: bulk Reply-To: endangered-languages-l at carmen.murdoch.edu.au

Trond Trosterud Trond.Trosterud at hum.uit.no
Thu Jul 15 08:17:14 UTC 1999


 *** EOOH ***
 Return-Path: <owner-endangered-languages-l at carmen.murdoch.edu.au>
 X-Authentication-Warning: carmen.murdoch.edu.au: majodomo set sender to
 owner-endangered-languages-l at carmen.murdoch.edu.au using -f
 X-Sender: trond at hugin.isl.uit.no (Unverified)
 Content-Type: text/plain; charset="x-levi"
 Date: Thu, 15 Jul 1999 10:17:14 +0200
 To: endangered-languages-l at carmen.murdoch.edu.au
 From: Trond Trosterud <Trond.Trosterud at hum.uit.no>
 Subject: ELL: Min lgs, lg technology and cost, incl. a success story on
  Southern Saami
  X-MIME-Autoconverted: from quoted-printable to 8bit by carmen.murdoch.edu.au
 id QAA26888
 Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
 Precedence: bulk
 Reply-To: endangered-languages-l at carmen.murdoch.edu.au


 Here comes the report I announced in my previous posting, a late addition
 to the discussion of cheap and expensive lg technologies we had on this
 list some months ago, and at the same a time report of what we have done in
 Scandinavia. It is a nice little success story, so read on.




 Different amounts of money have ben put forward as price tags on lg
 development technology. As several people have said, different technologies
 have different price tags. I still think that it is a mistake to calculate
 that lg technology for additional lgs have to cost as much as the original
 technology, for the following two reasons:

1
Much of the original cost is in developing the technology. As soon as it is
there, large parts of it may be carried over to new languages. To take a
straightforward example, hyphenation programs may consist of two parts: a
lg dependent one, detecting possible boundary points in the word forms, and
a lg independent one, adding boundary-independent hyphenation points (e.g.,
when hyphenating, make sure one consonant is carried over to the next line)
and making the word processor handle these points in a correct manner.
Furthermore, since hyphenation is usually possible either at syllable
boundaries or between morphological formatives, and since syllable
structures across lgs have much in common, it is possible to apply not only
the programming technoque but also large parts of the lg dependent part to
several lgs. I want to stress that I do not claim that new lgs do not need
#any# work (thus, do not write comments like .yes!!!, additional lgs #do#
need much work.), I just stress that new lg versions can use what is
already done.

2
We see a paradigmatic shift within lg technology, from statistically based
to linguisticcally based solutions. This is particularly evident within the
field of morphology. Since English has had both the smallest amount of
morphology and the largest amount of development money available, it should
come as no surprise that many of the working technologies so far have
ignored morphology, and also otherwise have relied on statistics. This
approach requires huge amounts of corpora, both written, and (for speech
techology) trasliterated spoken. "Huge" means several tens (today also
hundreds) of millions of words, much more than have ever been published on
the vast majority of the world.s lgs. Using linguistically based technology
instead, it will become possible to get more out of less.



As an illustration, I would like to report on a co.eration project carried
out between the University of Troms.(http://www.uit.no), or rather the
faculty of Humanities: (http://www.hum.uit.no/) and the Finnish software
company Lingsoft (http://www.lingsoft.fi). Our deal was this: Lingsoft
provides access to their technology. The uiversity carries out the
linguistic analysis, and writes appropriate rules following the directions
of Lingsoft. Lingsoft compiles the rules into functioning programs. The
university possess all rights to using the resulting product for research
purposes, but the source code remains with Lingsoft. Development of and
possible profit sharing from commercial products are subject to further
negotiations. In the case of Southern S.i, with perhaps 500 speakers, no
large market exists, but speaking for the University of Troms. our
interest is both to develop tools for our lg research and to see working
solutions for end users. I cannot promise you a similar deal: S.i is a
domestic language in Finland, and also otherwise close to the home marked
of Lingsoft, in other cases they have had their academic cooperation
partners pay for the license (you can also do it for yourself, go directly
to the pc- or mac-version of the program (PC-KIMMO), which is available
here: http://www.sil.org/pckimmo/ (a book accompanies the program).
Lingsoft is interested in seing their software applied to as many languages
as possible, and in my opinion their technology is far better than
competing statistical approaches (they launch the first #working#
spell-chacker solution for Scandinavian languages now, it will be part of
MS office 2000).

What have we done?
In cooperation with Sjur Moshagen from Lingsoft I have developed a parser
for Southern S.i nouns, based upon Kimmo Koskeniemi.s Two-Level Morphology
(the technology is well-known in the marked, it is published in his thesis
Koskenniemi 1983, and a pc/mac-version program is available, as already
referred to). The total amount of work that we spent was perhaps one month,
most of it done by me (this information as a contribution to the discussion
on cost of software for min lgs). The result is a prototype, handling 4000
non-compound, non-derivated nouns, inflecting and recognising them in 7
cases and 2 numbers.

Southern S.i is not that easy: It has some rather complicated Umlaut
patterns (7 vowels / diphtongs may alternate in 6 different ways), the
shape of the affixes vary according to the number of syllables (even/odd),
and there are complicated stem alternations. What made the project possible
was that Southern S.i possesses a resource that many small lgs have: A
good reference grammar and a dictionary. We did not have hundreds of
millions of words, but we had a good reference grammar and a dictionary. A
further advantage was that Koskenniemi.s model was built for Finnish, also
a lg with a morphology rich enough to make listing of word-forms impossible.

The work on Southern S.i does not stop here, of course. But already now,
the basic technology is in place for developing pedagogical programs
(paradigm drill, etc.). To cover the whole lg, one must do compounds and
productive derivation, then the adjectives (not that hard), the verbs
(harder) and the closed parts-of-speech as well. Then the result must be
tested and neologisms and geographical names added, but a working spell
checker is within reach, as is products related to it: Information
retrieval via lexeme-search rather than string-search, automatical
registering of neologisms (important for terminological work), more
sophisticated pedagogical programs, etc., etc.

The moral of this story is that when done on a firm linguistic basis, lg
technology is interesting to universities, simply because it offers tools
that are needed in the ordinary research (tagging, parsing, etc.). In
principle, we then have a sharing of labour:

- the basic technology is provided by ling software companies, partly by
computational ling depts, and the development work is paid for by the
markets of the large lgs.
- the basic lg specific work is done by the lg and ling depts of the
universities
- comprehensive lexicons, input beyond grammar, proofreading etc. is done
by native speakers

This contribution was not written to "prove" that lg technology is a quick
fix, that it does not require work. On the contrary, it takes time. My
point is that this does #not# mean that the resulting price tag makes it
impossible to develop lg technolgy for min lgs. The basic developmet work
is paid for by academic institutions and by the large markets, its price
will drop. The really time-consuming work of expanding the lg specific
programs from a computerized version of often imperfect dictionaries and
reference grammars into robust, working solutions must be done by the
language community itself. And even though a min lg community may be both
small and poor, it is always rich enough to find the man-hours needed.

Sjur Moshagen and I wrote an article describing our work, contact me if you
are interested in seing it.

Trond.

-------------------------------------------------------------------
Trond Trosterud                                     t +47 7764 4763
Finsk institutt, Det humanistiske fakultet          h +47 7767 3639
N-9037 Universitetet i Troms. Noreg                f +47 7764 4239
Trond.Trosterud at hum.uit.no  http://www2.isl.uit.no/trond/index.html
Test string-please ignore:....-.....-....-....-........
-------------------------------------------------------------------


----
Endangered-Languages-L Forum: endangered-languages-l at carmen.murdoch.edu.au
Web pages http://carmen.murdoch.edu.au/lists/endangered-languages-l/
Subscribe/unsubscribe and other commands: majordomo at carmen.murdoch.edu.au
----

==================================================================
Date: Thu, 15 Jul 1999 14:14:56 +0200
To: endangered-languages-l at carmen.murdoch.edu.au
From: Birger Winsa <birger.winsa at finska.su.se>
Subject: ELL: European minorities
In-Reply-To: <199907150752.JAA06708 at CarlCox.iway.fr>
References: <l03130308b3b24843ebe1@[129.242.176.187]>
 <378CDFAB.9E9B60EF at loxinfo.co.th>
  <37830240.3111BAEE at gaia.es>
   <4.0.1.19990705234928.00e92ef0 at 192.168.1.1>
    <l03130300b3b0cefb89ed@[137.43.43.109]>
     <l03130301b3b1e37e3220@[129.242.176.187]>
     Mime-Version: 1.0
     Content-Type: multipart/alternative;
     		   boundary="=====================_1254237==_.ALT"
		   Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
		   Precedence: bulk
		   Reply-To: endangered-languages-l at carmen.murdoch.edu.au

		   *** EOOH ***
		   Return-Path:
		   <owner-endangered-languages-l at carmen.murdoch.edu.au>
		   X-Authentication-Warning: carmen.murdoch.edu.au: majodomo
		   set sender to
		   owner-endangered-languages-l at carmen.murdoch.edu.au using -f
		   X-Sender: winsa at mail.datakom.su.se
		   Date: Thu, 15 Jul 1999 14:14:56 +0200
		   To: endangered-languages-l at carmen.murdoch.edu.au
		   From: Birger Winsa <birger.winsa at finska.su.se>
		   Subject: ELL: European minorities
		   In-Reply-To: <199907150752.JAA06708 at CarlCox.iway.fr>
		   Content-Type: multipart/alternative;
		   		 boundary="=====================_1254237==_.ALT"
				 Sender:
		   owner-endangered-languages-l at carmen.murdoch.edu.au
		   Precedence: bulk
		   Reply-To: endangered-languages-l at carmen.murdoch.edu.au

		   --=====================_1254237==_.ALT
		   Content-Type: text/plain; charset="iso-8859-1"
		   Content-Transfer-Encoding: quoted-printable

		   The European Union has carried out studies on the
		   territorial lesser used
		   languages and minority groups of the EU. The reports are
		   now accessible=
		    through
		    internet. The most recent work includes the minorities in
		   Sweden, Finland=
		    and
		    Austria.

		    http://www.uoc.es/euromosaic/web/homean/index1.html



		    **************************************************************************
		    Birger Winsa
		    Department of Finnish <http://www.finska.su.se>
		    Stockholm University
		    S-106 91 Stockholm
		    Sweden
		    Fax +46-(0)8-158871
		    Tel +46-(0)8-162359
		    E-mail: birger.winsa at finska.su.se
		    Databas =F6ver minoritetspolitiska beslut:
		    <http://www.kiruna.se/~ddd/>
		    Projekt: Kulturgr=E4ns Norr
		    <http://www.umu.se/nordiska/KGN>
		    ****************************************************************************=
		    **


		    --=====================_1254237==_.ALT
		    Content-Type: text/html; charset="iso-8859-1"
		    Content-Transfer-Encoding: quoted-printable

		    <html>
		    The European Union has carried out studies on the
		    territorial lesser used
		    languages and minority groups of the EU. The reports are
		    now accessible
		    through internet. The most recent work includes the
		    minorities in Sweden,
		    Finland and Austria.<br>
		    <br>
		    <font color=3D"#0000FF"><u><a=
		     href=3D"http://www.uoc.es/euromosaic/web/homean/index1.html"=
		      eudora=3D"autourl">http://www.uoc.es/euromosaic/web/homean/index1.html<br>
		      <br>
		      <br>
		      </a></font></u><br>

		      <font=
		       size=3D2>******************************************************************=
		       ********<br>
Birger Winsa<br>
Department of Finnish=20
<<a href=3D"http://www.finska.su.se/"=
 eudora=3D"autourl">http://www.finska.su.se</a>><br>
 Stockholm University<br>
 S-106 91 Stockholm<br>
 Sweden<br>
 Fax +46-(0)8-158871<br>
 Tel +46-(0)8-162359<br>
 E-mail: birger.winsa at finska.su.se<br>
 Databas =F6ver minoritetspolitiska beslut:
 <<a href=3D"http://www.kiruna.se/~ddd/"=
  eudora=3D"autourl">http://www.kiruna.se/~ddd/</a>><br>
  Projekt: Kulturgr=E4ns Norr
  <<a href=3D"http://www.umu.se/nordiska/KGN"=
   eudora=3D"autourl">http://www.umu.se/nordiska/KGN</a>><br>
   ****************************************************************************=
   **<br>
   <br>
   </font></html>

   --=====================_1254237==_.ALT--

   ----
   Endangered-Languages-L Forum: endangered-languages-l at carmen.murdoch.edu.au
   Web pages http://carmen.murdoch.edu.au/lists/endangered-languages-l/
   Subscribe/unsubscribe and other commands: majordomo at carmen.murdoch.edu.au
   ----

   =========================================================================
   Date: Thu, 15 Jul 1999 14:30:29 +0200
   To: endangered-languages-l at carmen.murdoch.edu.au
   From: Trond Trosterud <Trond.Trosterud at hum.uit.no>
   Subject: Re: ELL: re: Machine translation for Akha
   Content-Transfer-Encoding: 8bit
   X-MIME-Autoconverted: from quoted-printable to 8bit by
   carmen.murdoch.edu.au id UAA02783
   Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
   Precedence: bulk
   Reply-To: endangered-languages-l at carmen.murdoch.edu.au

   *** EOOH ***
   Return-Path: <owner-endangered-languages-l at carmen.murdoch.edu.au>
   X-Authentication-Warning: carmen.murdoch.edu.au: majodomo set sender to
   owner-endangered-languages-l at carmen.murdoch.edu.au using -f
   X-Sender: trond at hugin.isl.uit.no
   In-Reply-To: <199907150752.JAA06708 at CarlCox.iway.fr>
   Content-Type: text/plain; charset="x-levi"
   Date: Thu, 15 Jul 1999 14:30:29 +0200
   To: endangered-languages-l at carmen.murdoch.edu.au
   From: Trond Trosterud <Trond.Trosterud at hum.uit.no>
   Subject: Re: ELL: re: Machine translation for Akha
   X-MIME-Autoconverted: from quoted-printable to 8bit by
   carmen.murdoch.edu.au id UAA02783
   Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
   Precedence: bulk
   Reply-To: endangered-languages-l at carmen.murdoch.edu.au

   In his comment upon my postin on MT for Akha Jeff reveals that I know more
   about morphology than about MT, which is certainly true.

   When it comes to choosing between a quick and bad or a long-term and good
   MT, the answer may be dependent upon what lgs we are dealing with. If I
   know a lot on a subject (say min lg issues) and want to get an overview
   over japanese research on this issue in Siberia (say), then I want #any# MT
   system for Japanese now, right away, and I will accept almost anything. But
   if I, like the Swedish truck company Scania, want to translate user manuals
   into several lgs, obviously the standards are higher.

   The Example-Based MT system that jeff refers to seem to work better the
   more examples there are to base it upon.

   One way to get such material is to gather parallel corpora (PC). PCs are
   when e.g. a novel and its translations are aligned, sentence -by-sentence.
   As you can imagine, sentence 19865 in the original does not correspond to
   sentence 19865 in the translation, so things must be done.


   There exists software for that:

   http://www.hf.uio.no/iba/prosjekt/

   When the texts are aligned, the result can be used to extract translation
   equivalents. Handy for terminology work (software is underway that extracts
   not only new terms but their translations, in a very sofisticated way).
   Whether it helps for EBMT as well I do not know, but the tone in jeff.s
   posting was quite optimistic.

   The morphological ground work that I described in my last posting is of
   course not that urgent when morphology is marginal or not existing (some
   m-processes usually exist, though).
   
Anyway, The S.i lgs do not have any MT systems, and although they
obviously will need them to fulfil future bilingual legislations, there are
more than enough undone tasks before that. Transferred to an Akha setting:

A large, good dictionary Thai-Akha
Consensus on Akha orthography, or at least knowledge of competing systems
Conversion routines for changing text from one orthography into another
Spel chackers (sic!) and hyphenation programs
Standardised keyboard layout and encoding (for orth. outside ascii)
html editing tools (for orth. outside ascii)

One good thing about m- and s-parsers (apart from being the base for good
mt-systems) is that you may have more intelligent spell checkers as well
(grammar checker, reacting against gibberish syntax, etc.).

The bottom line is that reality is far from the technological doomesday
that we hear about in the media (only the Big have the resources). Reality
is that the Big ones can make the mistakes, and the small ones go in and
pick the good solutions.


Trond

-------------------------------------------------------------------
Trond Trosterud                                     t +47 7764 4763
Finsk institutt, Det humanistiske fakultet          h +47 7767 3639
N-9037 Universitetet i Troms. Noreg                f +47 7764 4239
Trond.Trosterud at hum.uit.no  http://www2.isl.uit.no/trond/index.html
Test string-please ignore:....-.....-....-....-........
-------------------------------------------------------------------


----
Endangered-Languages-L Forum: endangered-languages-l at carmen.murdoch.edu.au
Web pages http://carmen.murdoch.edu.au/lists/endangered-languages-l/
Subscribe/unsubscribe and other commands: majordomo at carmen.murdoch.edu.au
----




More information about the Endangered-languages-l mailing list