[Lexicog] Re: Citation forms in Prefixing Languages

Ron Moe ron_moe at SIL.ORG
Wed Jan 21 04:37:00 UTC 2004


The idea of producing an in-line paper parser is brilliant. It wouldn't work
for most Bantu languages because there are 5-6 slots and 5-20 prefixes in
each resulting in something like 50,000 possible prefix combinations. But
there are only around 220 combinations in Maguindanaon, bringing it within
range. Many of these prefix combinations can only be followed by a few
segments and those segments will be the beginning of the root:

If the word you are looking for starts with "papeds"
as in "papedsageb" look under "sageb" (paped+sageb)

If the word you are looking for starts with "papedt"
as in "papedtalu" look under "talu" (paped+talu)

Other strings are ambiguous and we would need to suggest two or more
options:

If the word you are looking for starts with "nag"
as in "nagayuk" look under "ayuk" (nag+ayuk)
as in "nagilip" look under "gilip" (na+gilip)

If the word you are looking for starts with "pan"
as in "panila" look under "dila" (paN+dila)
as in "pananam" look under "nanam" (paN+nanam)
as in "panageb" look under "sageb" (paN+sageb)
as in "panalu" look under "talu" (paN+talu)

We would need to test different formats to determine what made sense to the
user. We could determine how frequent each possibility is and put the most
frequent first. These pseudo-entries could be put in boxes to highlight
them.

Ron Moe
SIL, Uganda

-----Original Message-----
From: Mike Maxwell [mailto:maxwell at ldc.upenn.edu]
Sent: Tuesday, January 20, 2004 7:24 PM
To: lexicographylist at yahoogroups.com
Subject: Re: [Lexicog] Re: Citation forms in Prefixing Languages


Koontz John E wrote:
> This calls to mind trying to find entries in a Classical Greek lexicon
> starting with an inflected irregular non-present stem form from text,
> though at least student versions sometimes list the first person of
> common irregular stems.

I think the crucial issue is how many difficult-to-parse forms there are,
where difficulty can be caused by irregular forms (as here), or by opacity
(as in many Philippine languages), or by sheer number of prefixes (as in
Bantu), or by some combination of these (as in Athabaskan languages).  If
there aren't too many difficult forms, you can list them among the other
entries (i.e. as minor entries).  But if they're overwhelming (as I would
imagine the Bantu ones are--after all, this is an agglutinating language),
then listing is probably not an option, because the vast majority of the
entries would be these minor entries.

It would be interesting to see how bad it would be to list all or most of
the prefixed forms in these other languages.  My impression (from very
limited experience) is that Cebuano (Philippine) would not be too bad,
Tagalog would be somewhat worse, and Athabaskan languages would be nearly as
bad as the Bantu languages (in terms of the number of prefixed forms that
would be listed--my bias is that Athabaskan is much worse in terms of
opacity and complexity of derivation).

Ron Moe wrote:
> So if you encounter 'minadag' you would have to look under
> 'adag' 'padag' 'badag' and 'madag', hoping that one of them
> was the word you were looking for. If you wanted to find 'manadag',
> you might find it under 'adag' 'tadag' 'dadag' 'nadag' or 'sadag'.
> And that's assuming you were familiar enough with the language
> to even know where to look, and analytical enough to recognize
> potential prefixes and figure out what they might be hiding.

OK, so suppose we assume the dictionary user doesn't know the language well
enough to do this (or is too unsophisticated to think of the problem in this
way).  Can we help him out by listing in the printed dictionary all the
possible "prefix strings"--where by "prefix string" I mean the string of
characters up to the first invariable letter or so of the stem?  This would
then be cross referenced to the various possible citation forms.  So for
this example, we would have pseudo-lex entries like

    mina... See a..., pa..., ba..., ba..., ma....
    mine... See e..., pe..., be..., be..., me....
and
    mana...  See a..., ta..., da..., na..., sa....
etc.

So the number of these pseudo-entries would be on the order of number of
prefix sequences * the number of letters that can follow the opaque
prefixes, rather than the number of prefix sequences * the number of roots
of the appropriate morphosyntactic class.  A much smaller number, I would
think, and perhaps manageable in some cases.

The use of pseudo-lex entries would take some training, but it might be
better than trying to teach the opaque (in the linguistic sense, although
it's likely to seem opaque in the other sense!) phonological and
morphological processes.

Of course the real answer is a computer program that parses a wordform and
gives you a pointer to the root(s), and a pocket computer that this will run
on.  Some day.  (Actually, we're working on this, but it is apt to be
impractical in many cases.  And you have to build the parser for each
language.)

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu





Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/lexicographylist/

To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/





------------------------ Yahoo! Groups Sponsor ---------------------~-->
Buy Ink Cartridges or Refill Kits for your HP, Epson, Canon or Lexmark
Printer at MyInks.com. Free s/h on orders $50 or more to the US & Canada.
http://www.c1tracking.com/l.asp?cid=5511
http://us.click.yahoo.com/mOAaAA/3exGAA/qnsNAA/HKE4lB/TM
---------------------------------------------------------------------~->

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/lexicographylist/

To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list