[Lexicog] Issues regarding a free dictionary

Andrew Dunbar hippytrail at GMAIL.COM
Thu Dec 8 20:23:21 UTC 2005


These problems have been solved before with varying degrees of success.

For Thai as well as Chinese and Japanese, there are various algorithms or
systems available to find breaks between words.

For inflection, compounding, and other morphology and such issues which
give rise to many correct forms, either a "smart" spellchecker which knows
about paradigms and irregular forms, as well as containing a dictionary is
one approach. Another approach is to put the "smarts" into a program which
builds a full dictionary including all inflections etc from a basic dictionary.

I have seen a very good spellchecker for Irish which used the latter method.
The Irish language has very many forms of each word due to assimilation
and lenition.

On 08 Dec 2005 14:16:48 -0500, maxwell at ldc.upenn.edu
<maxwell at ldc.upenn.edu> wrote:
> On Dec 8 2005, Sabine Cretella wrote:
> > Btw.: we are also foreseeing the possibility to create spellcheckers :-)
>
> One thing that is easy to forget is the fact that spellcheckers based on
> dictionaries only work if the language (1) writes words with some kind of
> delineating mark (such as a space character on either side), and (2) has
> trivial inflectional morphology. Requirement (1) makes it difficult, if not
> impossible, to envision a Thai spell checker (there is no space between
> words in written Thai), and requirement (2) makes it difficult to imagine a
> dictionary-based spell checker for languages like Tagalog or even Spanish.
>
> There are of course languages where (1) and (2) are not problems. But I
> would guess that most languages have some problems in area (2).
>
> In order to create spell checkers for languages with inflectional
> morphology, you have to have a way of creating the entire paradigm of every
> word. This is rarely non-trivial. Even for English, you have to cope with
> spelling variations that arise in affixed words ('tries', not *'trys'). For
> Spanish, you have to deal with much larger verbal paradigms, plus stem
> allomorphy ('tengo' and 'tienes', not *'teno' and *'tenes'). For languages
> like Tagalog, with infixing and reduplication, things are even worse.
>
> Even if you can come up with the correct paradigm of every word, the result
> may be such a large list that it becomes infeasible to use it as a static
> list in a spell checker. I forget what the number of forms of each verb in
> Finnish is, but it's a very large number, and an exhaustive _list_ of every
> form of every verb would simply not fit in memory. (There are other
> representations besides lists, of course, that can handle such languages;
> finite state transducers are a typical solution.)
>
> Another situation where simply listing the forms is cumbersome, if not
> outright impossible, is languages with free compounding or incorporation
> (and where the components of the compound are written "solid", i.e. without
> a space or dash or etc.). German is an example. IIRC, the Microsoft German
> spell checker has some sort of finite state solution for compounds.
>
> There are tools for generating paradigms (and compounds or incorporation);
> in trivial cases, one can do it with a language like perl or Python. But
> their use often requires significant effort by a linguist.
>
> I believe these are also issues for tools like Google. That is, if you want
> to look for a German noun, you probably want to find it with any of its
> case endings. You can't do that unless you type in all the forms (with 'OR'
> between), or unless Google has implemented a stemmer for that language. I
> asked the President(?) of Google about this after a talk he gave one time,
> but didn't get a clear answer. My guess is that they do stemming for some
> 'major' languages, but not for most languages.
>
> Then of course there are written languages whose spelling has not been
> standardized, such as Chechen. But then you don't need a spell checker :-).
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>


--
http://linguaphile.sf.net


------------------------ Yahoo! Groups Sponsor --------------------~--> 
1.2 million kids a year are victims of human trafficking. Stop slavery.
http://us.click.yahoo.com/WpTY2A/izNLAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list