[Lexicog] Issues regarding a free dictionary

Andrew Dunbar hippytrail at GMAIL.COM
Sat Dec 10 18:06:36 UTC 2005


On 12/9/05, Mike Maxwell <maxwell at ldc.upenn.edu> wrote:
> Andrew Dunbar wrote:
> > These problems have been solved before with varying degrees of success.
>
> Agreed, and as you say, with varying degrees of success.  The point of my
> message was that there is often (maybe usually) much more to creating a
> spell checker than just having a list of "dictionary words" in the language.
>
> > For Thai as well as Chinese and Japanese, there are various algorithms or
> > systems available to find breaks between words.
>
> I have it on good authority that the Thai word breakers have a very high
> error rate, perhaps 30%, despite some claims to the contrary.  I don't know
> about Japanese (which I imagine is much better) or Chinese.  Spell checking
> in Chinese would, I would have thought, been a moot issue, but maybe
> someone who knows more can comment.

Yes I think it's pretty well known now that the ease of word breaking in Thai
was greatly exaggerated. Online, I have come across work underway for much
better systems. Based on much bigger dictionaries I believe.

Sadly the situation in much much harder for Japanese and Chinese at least as
far as morpological analysis, which is required to ascertain which groups of
characters belong to the same "word". I do know that native Japanese speakers
often make many mistakes choosing the right character. In fact it could be that
it's possible to make a decent-ish Japanese spellchecker just by analyzing
common errors and completely ignoring word-breaking. The problems for Chinese
are surely similar but I suspect simpler due to its use of only one script as
opposed to Japanese's three. Also Japanese has often many ways to spell the
same word. Often both kanji and some kana spelling are fine, often there are
various ways of spelling the inflectional endings added to kanji.

> > For inflection, compounding, and other morphology and such issues which
> > give rise to many correct forms, either a "smart" spellchecker which knows
> > about paradigms and irregular forms, as well as containing a dictionary is
> > one approach. Another approach is to put the "smarts" into a program which
> > builds a full dictionary including all inflections etc from a basic dictionary.
>
> Yes, those were the two approaches I alluded to in my msg:
>
> (1) You could use a language like perl to turn a list of stems, citation
> forms, or some such into a list of all the fully inflected forms of the
> language, taking into account the various paradigms (if the language has
> multiple paradigms), irregular forms (like 'went' in place of *'goed'),
> stem allomorphy or spelling changes under affixation (like 'tries' in place
> of *'trys"), and then use the list directly in a traditional spell checker,
> like ispell.  (Or if the morphology is _really_ simple, and the word list
> is not too long, you can do this by hand.)  Spell correcters can work off
> such a list, too.
>
> (2) Alternatively, you can parse the inflected forms on the fly, using e.g.
> a finite state transducer.  Spell correction might be more difficult with
> this kind of approach (although I'm sure it can be done), but it is perhaps
> the only feasible route for highly inflected languages like Finnish.
>
> > I have seen a very good spellchecker for Irish which used the latter method.
>
> If the Irish spellchecker is based on the Irish morphology work that I'm
> familiar with, it uses (or at least is based on) the Xerox finite state
> toolkit.  This is work by Elaine Uí Dhonnchadha.

I was very impressed with the Irish work though I don't know the language I
have read "teach yourself" books a few times so I have an idea how tricky it
is if not how it works minutely. I assume there is only one person/team
actively working in the field of Irish spellchecking but again I could well be
wrong.

I hope some find my information useful, errors notwithstanding.

Andrew Dunbar.

> --
>        Mike Maxwell
>        maxwell at ldc.upenn.edu
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>


--
http://linguaphile.sf.net


------------------------ Yahoo! Groups Sponsor --------------------~--> 
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list