[Lexicog] polysynthetic languages and dictionaries

William J Poser billposer at ALUM.MIT.EDU
Wed May 26 05:06:11 UTC 2004


Wayne,

I understand that some languages have very large numbers of possible
fully inflected forms. Indeed, Athabaskan languages must be comparable
in this respect to Algonquian languages. But I don't see why this
is an obstacle to using electronic dictionaries as I have suggested.
Perhaps I am missing something.

Morphological parsers are certainly more difficult to write for
some languages than others, but I know of no reason that it can't
be done for even the most complex language. Note that, even if parsing is
hard to get right or time-consuming, generation is generally easier
and can be done in advance - you just store all of the forms.
Computer storage is so large now, and computers so fast, that this
isn't a problem. Suppose that you have 1000 verbs (plausible, if not
high, for a language in which so much work is done by the morphology,
I think) each of which has one million forms, for a total of 10e9 entries.
For the sake of argument, let's say that the entries average 25 characters,
that we're using Unicode, and that we're in a range that requires
4 bytes per character. The total storage required for the forms
themselves is then 10e11 bytes. This is an overestimate since, if we're
dealing with a single language, we could use an encoding requiring
fewer bytes per character. In fact, for almost all languages that don't
use Chinese characters, we could get it down to one byte per character.
In a dictionary of this type, there is relatively little information
associated with each individual entry - most of it is associated with
morphemes or in the grammar, and even example sentences will presumably
not be associated, in general, with fully specified forms. So it seems
plausible that the other information will require less storage than
the forms themselves. So the total size of the dictionary, under
these assumptions, which I think are pretty liberal (that is, overestimate
the true size), would be 2*10e11 bytes, or 200 gigabytes. Text of
this type is quite redundant - I think that it is reasonable to expect
to be able to compress this, in a way compatible with rapid search,
to no more than a quarter of this. So we're probably talking about
something like 50GB. That is well within the available storage for a
personal computer even now. The machine I'm writing on has 120GB.

Since the forms are ordered, efficient
forms of search can be used. Even plain binary search has a worst
case of 30 probes on 10e9 items, meaning that the search time will be
negligible. By way of comparison, I just used the Unix utility grep
to search for a form I knew to be at the end of a list of slightly
over a million Swahili verb forms I happen to have in a file, one
per line. On my home computer, a 1.6 GHz Pentium 4, by no means bleeding
edge, this search took only two seconds. That means that in two seconds
it did a million comparisons of the search form against the list,
each equivalent to one of the 30 probes necessary in a binary search
of a billion forms in the worst case.

I submit, therefore, that even if writing or running a morphological
parser is problematic, so long as we can generate all the forms in
advance, it is well within the realm of possibility to store them
and search them efficiently.

Bill




--
Bill Poser, Linguistics, University of Pennsylvania
http://www.ling.upenn.edu/~wjposer/ billposer at alum.mit.edu


------------------------ Yahoo! Groups Sponsor --------------------~-->
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list