[Lexicog] polysynthetic languages and dictionaries

Wayne Leman wayne_leman at SIL.ORG
Wed May 26 13:54:01 UTC 2004


Sorry, Bill, I left out a key part of what I was thinking that makes
electronic dictionaries more difficult for difficult for native speakers who
speak a polysynthetic language. It is that one of the major reasons these
speakers use the dictionary is to find the correct spelling for a word. If a
speaker wants the correct spelling for a verb that is 50 letters long, with,
say, 12 morphemes, it is highly likely, at least in our situation, that the
speaker will not know how to spell all the morphemes to locate the word in
the dictionary. So a parser, while helpful, if the speaker has gotten the
morphemes spelled right, will enable the correct spelling for the entire
word to show up, it might choke if the speaker cannot spell the entire word.

UNLESS, of course, we use some fuzzy logic, or spelled-something-like
programming, and/or programming code similar to what is in some
e-dictionaries for English and other major languages where the user simply
*begins* typing in the word desired and the program starts displaying all
possible spellings as soon as, say, the user has typed in five letters. I
have the American Heritage Dictionary of English on my computer and I often
use this "partial spelling" feature to locate a word for which I am not sure
of the spelling.

So, I think it is possible to meet the e-dictionary needs of the typical
native speakers of polysynthetic languages but I think it is going to take
some fairly powerful parsing and search engines. It would be fun to see some
e-dictionaries like this produced.

Wayne
-----
Wayne Leman
Cheyenne website: http://www.geocities.com/cheyenne_language

> Wayne,
>
> I understand that some languages have very large numbers of possible
> fully inflected forms. Indeed, Athabaskan languages must be comparable
> in this respect to Algonquian languages. But I don't see why this
> is an obstacle to using electronic dictionaries as I have suggested.
> Perhaps I am missing something.
>
> Morphological parsers are certainly more difficult to write for
> some languages than others, but I know of no reason that it can't
> be done for even the most complex language. Note that, even if parsing is
> hard to get right or time-consuming, generation is generally easier
> and can be done in advance - you just store all of the forms.
> Computer storage is so large now, and computers so fast, that this
> isn't a problem. Suppose that you have 1000 verbs (plausible, if not
> high, for a language in which so much work is done by the morphology,
> I think) each of which has one million forms, for a total of 10e9 entries.
> For the sake of argument, let's say that the entries average 25
characters,
> that we're using Unicode, and that we're in a range that requires
> 4 bytes per character. The total storage required for the forms
> themselves is then 10e11 bytes. This is an overestimate since, if we're
> dealing with a single language, we could use an encoding requiring
> fewer bytes per character. In fact, for almost all languages that don't
> use Chinese characters, we could get it down to one byte per character.
> In a dictionary of this type, there is relatively little information
> associated with each individual entry - most of it is associated with
> morphemes or in the grammar, and even example sentences will presumably
> not be associated, in general, with fully specified forms. So it seems
> plausible that the other information will require less storage than
> the forms themselves. So the total size of the dictionary, under
> these assumptions, which I think are pretty liberal (that is, overestimate
> the true size), would be 2*10e11 bytes, or 200 gigabytes. Text of
> this type is quite redundant - I think that it is reasonable to expect
> to be able to compress this, in a way compatible with rapid search,
> to no more than a quarter of this. So we're probably talking about
> something like 50GB. That is well within the available storage for a
> personal computer even now. The machine I'm writing on has 120GB.
>
> Since the forms are ordered, efficient
> forms of search can be used. Even plain binary search has a worst
> case of 30 probes on 10e9 items, meaning that the search time will be
> negligible. By way of comparison, I just used the Unix utility grep
> to search for a form I knew to be at the end of a list of slightly
> over a million Swahili verb forms I happen to have in a file, one
> per line. On my home computer, a 1.6 GHz Pentium 4, by no means bleeding
> edge, this search took only two seconds. That means that in two seconds
> it did a million comparisons of the search form against the list,
> each equivalent to one of the 30 probes necessary in a binary search
> of a billion forms in the worst case.
>
> I submit, therefore, that even if writing or running a morphological
> parser is problematic, so long as we can generate all the forms in
> advance, it is well within the realm of possibility to store them
> and search them efficiently.
>
> Bill



------------------------ Yahoo! Groups Sponsor --------------------~-->
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list