[Lexicog] polysynthetic languages and dictionaries

Antti Arppe aarppe at LING.HELSINKI.FI
Wed May 26 10:16:04 UTC 2004


To those interested in morphological parsers,

Having worked myself with morphological parsers, I could cast in my
two cents into the discussion.

The languages that I have been involved with, mainly the Nordic
language i.e. Finnish, Swedish, Norwegian and Danish do in theory have
a large number of possible word forms if compared with English as an
extreme example.

In the case of Finnish, each verb can potentially have some 20,000
forms, each adjective some 6,000 and each noun some 2,000. This
coupled with derivation and compounding basically means that it is
practically impossible to reach any sort of reasonable coverage with
word lists, which has been been the lesson learnt by some software
companies which have tried to provide wordlist based spell-checkers
for Finnish. Even though the Scandinavian languages have considerable
less inflectional forms, they have similar compounding as Finnish,
which in practice also renders the word list approach unfeasible for
non-restricted running text.

As a natural consequence, Finnish researchers among others have had to
resort to various solutions, which combine a basic lexicon and a rule
system, which can then be converted into a morphological analysis
program. Probably one of the best known such approach is prof. Kimmo
Koskenniemi's Two-Level Model which has in one extent or another been
applied for applied to some one hundred languages, including Klingon.
Other researchers in the field include Lauri Karttunen, who has for
some time been active the States but has Finnish roots.  Karttunen's
model has been developed by Xerox which has similar platform which is
based on so called finite-state automata. I.e. the lexicon and rule
set are converted into such a data structure. In the case of Finnish,
with some 40,000 roots in the lexicon, an extensive derivational rule
set, and a full morphological rule set, one could from some 30 million
words worth of Finnish newspaper text, representing some 4 million
distinct word forms, recognize in the range of 90% (the rest being
non-Finnish words and proper names).

If one has experience about creating such lexicons and rule sets, a
working pilot version can well be put together within 6-9 months.  If
not, it will obviously take longer. Rudimentary working models have
been developed as parts of master's theses (e.g. Danish) and doctoral
dissertations, and the technologies mentioned should be fairly
familiar to computer scientists with a linguistic interest and
computational linguists. Nevertheless, such analyzers have in the
recent years been developed also for smaller indigenous languages,
such as the North Sámi with some 30,000 speakers, with the financing
of the Norwegian state, see
<http://www.ling.helsinki.fi/uhlcs/saletek/index.shtml>, the text of
which is in Norwegian. I have recently also understood that such an
analyzer is in the works for Greenlandish Inuit.

Unless things have drastically changed from the times I worked in the
language industry, the tools needed for creating such morphological
analyzers are available for academic research purposes for minimal
costs for Universities and the like. Such an academic price does not,
however, include any real support; one has to learn by trial and error
mainly on one's own.  Companies that one can contact include Lingsoft
and Connexor in Finland and XSoft which licences the Xerox tools:

	www.lingsoft.fi
	www.connexor.fi

Having worked with both of these companies please bear in mind that
they are commercial enterprises. Unless you get a get a sufficient
grant to pay for support and development and the like, the companies
really can't afford to do it for free, either. In addition, be
prepared for perusing some pretty lengthy and detailed contracts,
which these companies have for academic licences. Finally, I was never
personally involved in the actual creation of the lexica and rule
sets, instead I was the project manager, so personally I can only
point to organizations and people who might be of assistance.

The main value of such analyzers is, of course, that one isn't limited
to a portion of the vocabulary. If the vocabulary to be parsed is
fairly limited one can maybe get along with a word list.

But I'd suggest figuring out what sort and amounts of texts one needs
to have morphologically parsed, then consider whether one could get
some 12 month financing for such a project, and finally contact some
of mentioned companies or others.

Hoping this is of some interest and assistance, best regards,

	-Antti Arppe
--
======================================================================
Antti Arppe - Master of Science (Engineering)
Researcher & doctoral student (Linguistics)
E-mail: antti.arppe at helsinki.fi
WWW: http://www.ling.helsinki.fi/~aarppe
----------------------------------------------------------------------
Work: Department of General Linguistics, University of Helsinki
Work address: P.O. Box 9 (Siltavuorenpenger 20 A)
   00014 University of Helsinki, Finland
Work telephone: +358 9 19129312 (int'l) 09-19129312 (in Finland)
Work telefax: +358 9 19129307 (int'l) 09-19129307 (in Finland)
----------------------------------------------------------------------
Private address: Fleminginkatu 25 E 91, 00500 Helsinki, Finland
Private telephone: +358 50 5909015 (int'l) 050-5909015 (in Finland)
----------------------------------------------------------------------


On Wed, 26 May 2004, William J Poser wrote:
> I understand that some languages have very large numbers of possible
> fully inflected forms. Indeed, Athabaskan languages must be comparable
> in this respect to Algonquian languages. But I don't see why this
> is an obstacle to using electronic dictionaries as I have suggested.
> Perhaps I am missing something.
>
> Morphological parsers are certainly more difficult to write for
> some languages than others, but I know of no reason that it can't
> be done for even the most complex language. Note that, even if parsing is
> hard to get right or time-consuming, generation is generally easier
> and can be done in advance - you just store all of the forms.
> Computer storage is so large now, and computers so fast, that this
> isn't a problem. Suppose that you have 1000 verbs (plausible, if not
> high, for a language in which so much work is done by the morphology,
> I think) each of which has one million forms, for a total of 10e9 entries.
> For the sake of argument, let's say that the entries average 25 characters,
> that we're using Unicode, and that we're in a range that requires
> 4 bytes per character. The total storage required for the forms
> themselves is then 10e11 bytes. This is an overestimate since, if we're
> dealing with a single language, we could use an encoding requiring
> fewer bytes per character. In fact, for almost all languages that don't
> use Chinese characters, we could get it down to one byte per character.
> In a dictionary of this type, there is relatively little information
> associated with each individual entry - most of it is associated with
> morphemes or in the grammar, and even example sentences will presumably
> not be associated, in general, with fully specified forms. So it seems
> plausible that the other information will require less storage than
> the forms themselves. So the total size of the dictionary, under
> these assumptions, which I think are pretty liberal (that is, overestimate
> the true size), would be 2*10e11 bytes, or 200 gigabytes. Text of
> this type is quite redundant - I think that it is reasonable to expect
> to be able to compress this, in a way compatible with rapid search,
> to no more than a quarter of this. So we're probably talking about
> something like 50GB. That is well within the available storage for a
> personal computer even now. The machine I'm writing on has 120GB.
>
> Since the forms are ordered, efficient
> forms of search can be used. Even plain binary search has a worst
> case of 30 probes on 10e9 items, meaning that the search time will be
> negligible. By way of comparison, I just used the Unix utility grep
> to search for a form I knew to be at the end of a list of slightly
> over a million Swahili verb forms I happen to have in a file, one
> per line. On my home computer, a 1.6 GHz Pentium 4, by no means bleeding
> edge, this search took only two seconds. That means that in two seconds
> it did a million comparisons of the search form against the list,
> each equivalent to one of the 30 probes necessary in a binary search
> of a billion forms in the worst case.
>
> I submit, therefore, that even if writing or running a morphological
> parser is problematic, so long as we can generate all the forms in
> advance, it is well within the realm of possibility to store them
> and search them efficiently.
>
> Bill
>
>
>
>
> --
> Bill Poser, Linguistics, University of Pennsylvania
> http://www.ling.upenn.edu/~wjposer/ billposer at alum.mit.edu
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>

-- 
======================================================================
Antti Arppe - Master of Science (Engineering)
Researcher & doctoral student (Linguistics)
E-mail: antti.arppe at helsinki.fi
WWW: http://www.ling.helsinki.fi/~aarppe
----------------------------------------------------------------------
Work: Department of General Linguistics, University of Helsinki
Work address: P.O. Box 9 (Siltavuorenpenger 20 A)
   00014 University of Helsinki, Finland
Work telephone: +358 9 19129312 (int'l) 09-19129312 (in Finland)
Work telefax: +358 9 19129307 (int'l) 09-19129307 (in Finland)
----------------------------------------------------------------------
Private address: Fleminginkatu 25 E 91, 00500 Helsinki, Finland
Private telephone: +358 50 5909015 (int'l) 050-5909015 (in Finland)
----------------------------------------------------------------------


------------------------ Yahoo! Groups Sponsor --------------------~--> 
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list