[Lexicog] Collaborative lexicography software?
Allan Johnson
allan_johnson at SIL.ORG
Mon May 5 14:22:19 UTC 2008
maxwell at ldc.upenn.edu wrote:
> Quoting Heather Souter <hsouter at gmail.com>:
>
>> I, too, am very interested in learning about dictionary development
>> for languages with complex morphologies. ...
>> Any insight into how to create dictionaries that are useful to
>> speakers and learners and not only language specialists would be
>> especially welcomed!
>>
>
> One "solution" (quote marks explained at the end of this msg) is to
> give people a computer program that allows them to look up words
> regardless of the inflected form that they type in. For the simple
> cases, this can often be done by just looking for a substring of the
> typed-in word. For a purely suffixing language, the substring would
> begin at the first letter of the typed-in word.
>
> Of course, the simple cases are not the ones where people need the most
> help. The complex cases--where there is prefixing (or worse, both
> prefixing and suffixing), or infixing, or reduplication, or lots of
> stem allomorphy--are the ones where people need help, and where the
> simple solutions don't work. For these morphologically complex
> languages, there needs to be a morphological parser between the user
> and the electronic dictionary per se.
For a dictionary user to be able to look up any wordform in a
computer-based (maybe online) dictionary, another approach would be to
explicitly list all forms in the dictionary. Since such a dictionary
would take a lot of paper to print, we're in the habit of avoiding such
an approach. But as I've explored the capabilities of the FLEx program,
it strikes me that there seems to be an appropriate place to explicitly
list any wordform that we might desire to include as a lookup form. A
derived form can be given its own place as the headword of an entry, and
linked as a "complex form" to the root or stem from which it's derived.
An inflected form can be given its own place as the headword of a minor
entry and linked as an "inflectional variant" to the uninflected form of
the stem, or to the inflected form that users will most likely try to
look up.
Automated parsing could still have a role in such a dictionary, but the
role would be to assist in building the dictionary rather than to assist
in reading it. When analyzing words that it encounters in vernacular
texts, the parser would draw its conclusion regarding what roots and
affixes make up the word, and thus what entries it should be linked to.
Based on his knowledge of the actual meanings of the words, the human
dictionary compiler would then evaluate whether to accept the parser's
choice or make links that the parser didn't predict. If it involves some
regularity of the language that the parser just doesn't yet handle, the
dictionary compiler could use this parser failure as feedback to help
improve the parser's success in future predictions. If it involves an
irregularity of the language which can't reasonably be captured by the
parser, then it can just be left as residue as far as the parser is
concerned. A dictionary user will still be able to find the word, since
it has been explicitly listed and linked.
This approach wouldn't do anything for the finding of words that haven't
yet been encountered in texts. So once the parser has "learned" the
language well enough to give fairly reliable results, it might be
profitable to combine Mike's approach with this one - using the parser
for lookup of any words that don't yet have exact matches in the
dictionary. And whenever this happens, the newly looked-up words could
be submitted for human review so that they can be explicitly listed for
future lookups.
Allan J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20080505/71d4f915/attachment.htm>
More information about the Lexicography
mailing list