[Lexicog] Issues regarding a free dictionary
Mike Maxwell
maxwell at LDC.UPENN.EDU
Thu Dec 8 19:16:48 UTC 2005
On Dec 8 2005, Sabine Cretella wrote:
> Btw.: we are also foreseeing the possibility to create spellcheckers :-)
One thing that is easy to forget is the fact that spellcheckers based on
dictionaries only work if the language (1) writes words with some kind of
delineating mark (such as a space character on either side), and (2) has
trivial inflectional morphology. Requirement (1) makes it difficult, if not
impossible, to envision a Thai spell checker (there is no space between
words in written Thai), and requirement (2) makes it difficult to imagine a
dictionary-based spell checker for languages like Tagalog or even Spanish.
There are of course languages where (1) and (2) are not problems. But I
would guess that most languages have some problems in area (2).
In order to create spell checkers for languages with inflectional
morphology, you have to have a way of creating the entire paradigm of every
word. This is rarely non-trivial. Even for English, you have to cope with
spelling variations that arise in affixed words ('tries', not *'trys'). For
Spanish, you have to deal with much larger verbal paradigms, plus stem
allomorphy ('tengo' and 'tienes', not *'teno' and *'tenes'). For languages
like Tagalog, with infixing and reduplication, things are even worse.
Even if you can come up with the correct paradigm of every word, the result
may be such a large list that it becomes infeasible to use it as a static
list in a spell checker. I forget what the number of forms of each verb in
Finnish is, but it's a very large number, and an exhaustive _list_ of every
form of every verb would simply not fit in memory. (There are other
representations besides lists, of course, that can handle such languages;
finite state transducers are a typical solution.)
Another situation where simply listing the forms is cumbersome, if not
outright impossible, is languages with free compounding or incorporation
(and where the components of the compound are written "solid", i.e. without
a space or dash or etc.). German is an example. IIRC, the Microsoft German
spell checker has some sort of finite state solution for compounds.
There are tools for generating paradigms (and compounds or incorporation);
in trivial cases, one can do it with a language like perl or Python. But
their use often requires significant effort by a linguist.
I believe these are also issues for tools like Google. That is, if you want
to look for a German noun, you probably want to find it with any of its
case endings. You can't do that unless you type in all the forms (with 'OR'
between), or unless Google has implemented a stemmer for that language. I
asked the President(?) of Google about this after a talk he gave one time,
but didn't get a clear answer. My guess is that they do stemming for some
'major' languages, but not for most languages.
Then of course there are written languages whose spelling has not been
standardized, such as Chechen. But then you don't need a spell checker :-).
------------------------ Yahoo! Groups Sponsor --------------------~-->
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list