[Lexicog] Collaborative lexicography software?

Mon May 5 14:22:19 UTC 2008

maxwell at ldc.upenn.edu wrote:
> Quoting Heather Souter <hsouter at gmail.com>:
>   
>> I, too, am very interested in learning about dictionary development
>> for languages with complex morphologies.  ...
>> Any insight into how to create dictionaries that are useful to
>> speakers and learners and not only language specialists would be
>> especially welcomed!
>>     
>
> One "solution" (quote marks explained at the end of this msg) is to 
> give people a computer program that allows them to look up words 
> regardless of the inflected form that they type in.  For the simple 
> cases, this can often be done by just looking for a substring of the 
> typed-in word.  For a purely suffixing language, the substring would 
> begin at the first letter of the typed-in word.
>
> Of course, the simple cases are not the ones where people need the most 
> help.  The complex cases--where there is prefixing (or worse, both 
> prefixing and suffixing), or infixing, or reduplication, or lots of 
> stem allomorphy--are the ones where people need help, and where the 
> simple solutions don't work.  For these morphologically complex 
> languages, there needs to be a morphological parser between the user 
> and the electronic dictionary per se.

For a dictionary user to be able to look up any wordform in a 
computer-based (maybe online) dictionary, another approach would be to 
explicitly list all forms in the dictionary. Since such a dictionary 
would take a lot of paper to print, we're in the habit of avoiding such 
an approach. But as I've explored the capabilities of the FLEx program, 
it strikes me that there seems to be an appropriate place to explicitly 
list any wordform that we might desire to include as a lookup form. A 
derived form can be given its own place as the headword of an entry, and 
linked as a "complex form" to the root or stem from which it's derived. 
An inflected form can be given its own place as the headword of a minor 
entry and linked as an "inflectional variant" to the uninflected form of 
the stem, or to the inflected form that users will most likely try to 
look up.

Automated parsing could still have a role in such a dictionary, but the 
role would be to assist in building the dictionary rather than to assist 
in reading it. When analyzing words that it encounters in vernacular 
texts, the parser would draw its conclusion regarding what roots and 
affixes make up the word, and thus what entries it should be linked to. 
Based on his knowledge of the actual meanings of the words, the human 
dictionary compiler would then evaluate whether to accept the parser's 
choice or make links that the parser didn't predict. If it involves some 
regularity of the language that the parser just doesn't yet handle, the 
dictionary compiler could use this parser failure as feedback to help 
improve the parser's success in future predictions. If it involves an 
irregularity of the language which can't reasonably be captured by the 
parser, then it can just be left as residue as far as the parser is 
concerned. A dictionary user will still be able to find the word, since 
it has been explicitly listed and linked.

This approach wouldn't do anything for the finding of words that haven't 
yet been encountered in texts. So once the parser has "learned" the 
language well enough to give fairly reliable results, it might be 
profitable to combine Mike's approach with this one - using the parser 
for lookup of any words that don't yet have exact matches in the 
dictionary. And whenever this happens, the newly looked-up words could 
be submitted for human review so that they can be explicitly listed for 
future lookups.

Allan J.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20080505/71d4f915/attachment.htm>