[Lexicog] Issues regarding a free dictionary

Mike Maxwell maxwell at LDC.UPENN.EDU
Fri Dec 9 14:33:38 UTC 2005


Andrew Dunbar wrote:
> These problems have been solved before with varying degrees of success.

Agreed, and as you say, with varying degrees of success.  The point of my 
message was that there is often (maybe usually) much more to creating a 
spell checker than just having a list of "dictionary words" in the language.

> For Thai as well as Chinese and Japanese, there are various algorithms or
> systems available to find breaks between words.

I have it on good authority that the Thai word breakers have a very high 
error rate, perhaps 30%, despite some claims to the contrary.  I don't know 
about Japanese (which I imagine is much better) or Chinese.  Spell checking 
in Chinese would, I would have thought, been a moot issue, but maybe 
someone who knows more can comment.

> For inflection, compounding, and other morphology and such issues which
> give rise to many correct forms, either a "smart" spellchecker which knows
> about paradigms and irregular forms, as well as containing a dictionary is
> one approach. Another approach is to put the "smarts" into a program which
> builds a full dictionary including all inflections etc from a basic dictionary.

Yes, those were the two approaches I alluded to in my msg:

(1) You could use a language like perl to turn a list of stems, citation 
forms, or some such into a list of all the fully inflected forms of the 
language, taking into account the various paradigms (if the language has 
multiple paradigms), irregular forms (like 'went' in place of *'goed'), 
stem allomorphy or spelling changes under affixation (like 'tries' in place 
of *'trys"), and then use the list directly in a traditional spell checker, 
like ispell.  (Or if the morphology is _really_ simple, and the word list 
is not too long, you can do this by hand.)  Spell correcters can work off 
such a list, too.

(2) Alternatively, you can parse the inflected forms on the fly, using e.g. 
a finite state transducer.  Spell correction might be more difficult with 
this kind of approach (although I'm sure it can be done), but it is perhaps 
the only feasible route for highly inflected languages like Finnish.

> I have seen a very good spellchecker for Irish which used the latter method.

If the Irish spellchecker is based on the Irish morphology work that I'm 
familiar with, it uses (or at least is based on) the Xerox finite state 
toolkit.  This is work by Elaine Uí Dhonnchadha.
-- 
	Mike Maxwell
	maxwell at ldc.upenn.edu


------------------------ Yahoo! Groups Sponsor --------------------~--> 
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list