[Lexicog] Issues regarding a free dictionary

Mike Maxwell maxwell at LDC.UPENN.EDU
Thu Dec 8 19:16:48 UTC 2005


On Dec 8 2005, Sabine Cretella wrote:
> Btw.: we are also foreseeing the possibility to create spellcheckers :-)

One thing that is easy to forget is the fact that spellcheckers based on 
dictionaries only work if the language (1) writes words with some kind of 
delineating mark (such as a space character on either side), and (2) has 
trivial inflectional morphology. Requirement (1) makes it difficult, if not 
impossible, to envision a Thai spell checker (there is no space between 
words in written Thai), and requirement (2) makes it difficult to imagine a 
dictionary-based spell checker for languages like Tagalog or even Spanish.

There are of course languages where (1) and (2) are not problems. But I 
would guess that most languages have some problems in area (2).

In order to create spell checkers for languages with inflectional 
morphology, you have to have a way of creating the entire paradigm of every 
word. This is rarely non-trivial. Even for English, you have to cope with 
spelling variations that arise in affixed words ('tries', not *'trys'). For 
Spanish, you have to deal with much larger verbal paradigms, plus stem 
allomorphy ('tengo' and 'tienes', not *'teno' and *'tenes'). For languages 
like Tagalog, with infixing and reduplication, things are even worse.

Even if you can come up with the correct paradigm of every word, the result 
may be such a large list that it becomes infeasible to use it as a static 
list in a spell checker. I forget what the number of forms of each verb in 
Finnish is, but it's a very large number, and an exhaustive _list_ of every 
form of every verb would simply not fit in memory. (There are other 
representations besides lists, of course, that can handle such languages; 
finite state transducers are a typical solution.)

Another situation where simply listing the forms is cumbersome, if not 
outright impossible, is languages with free compounding or incorporation 
(and where the components of the compound are written "solid", i.e. without 
a space or dash or etc.). German is an example. IIRC, the Microsoft German 
spell checker has some sort of finite state solution for compounds.

There are tools for generating paradigms (and compounds or incorporation); 
in trivial cases, one can do it with a language like perl or Python. But 
their use often requires significant effort by a linguist.

I believe these are also issues for tools like Google. That is, if you want 
to look for a German noun, you probably want to find it with any of its 
case endings. You can't do that unless you type in all the forms (with 'OR' 
between), or unless Google has implemented a stemmer for that language. I 
asked the President(?) of Google about this after a talk he gave one time, 
but didn't get a clear answer. My guess is that they do stemming for some 
'major' languages, but not for most languages.

Then of course there are written languages whose spelling has not been 
standardized, such as Chechen. But then you don't need a spell checker :-).



------------------------ Yahoo! Groups Sponsor --------------------~--> 
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list