[Lexicog] Dictionary software

Bill Poser billposer2 at GMAIL.COM
Tue Apr 29 01:30:04 UTC 2014


It sounds like Wiktionary has a morphological generator, though I have to
say I'm surprised. Is it really capable of handling complex morphology?


On Mon, Apr 28, 2014 at 6:13 PM, Benjamin Barrett <
benjaminbarrett85 at gmail.com> wrote:

>
>
> I'm not sure about the parser/generator part. As I said, Wiktionary allows
> you to write the rules so that when a verb or other POS is entered (with or
> without irregular forms), pages for each form is generated so the
> dictionary user can look up any form. That, of course, includes
> reduplication forms as well. You can see this by entering forms like eaten,
> vado, 行かない, etc., at https://en.wiktionary.org/wiki.
>
> Unlike situations where a print dictionary is the object, I don't see
> variations in lexical categories as too critical for the Lushootseed
> project. The purpose of an online dictionary is widespread, easy access,
> and while consistency is of course desirable, access to good information is
> more important. 15 categories expanding to 118 is extreme, though; by
> monitoring new entries, we can hopefully cut issues like that in the bud by
> contacting editors and making changes to the entry templates.
>
> As for inconsistencies in entries, by creating templates, those can be
> reduced, but the open format of Wiktionary is definitely a drawback in that
> respect. Again, though, I don't view missing fields or inconsistency in
> field order as primary in importance for this online project.
>
> What I imagine is people learning their heritage language sitting at home
> and wondering how to say "travel by land," and they pull out their
> smartphone to get the word and hopefully they memorize the simple sample
> sentence provided while they're looking at the page.
>
> Ben Barrett
> La Conner, WA
>
> Learn Ainu! https://sites.google.com/site/aynuitak1/videos
>
> On Apr 28, 2014, at 9:20 AM, Bill Poser <billposer2 at gmail.com> wrote:
>
>
>
> As a bit of data in support of Mike's point that it is desirable to
> validate manually created databases, when I wrote the code to produce print
> dictionaries from Jonathan Amith's Oapan and Ameyaltepec Nahuatl database,
> which was in something like the SIL SDF format but not created using
> Shoebox or Toolbox, I initially found something like 118 lexical
> categories. This was due to variations in capitalization, choice of
> abbreviation, and use of both English and Spanish. We ended up with 15
> after merging all the variants that had crept in.
>
>
> On Mon, Apr 28, 2014 at 9:09 AM, Mike Maxwell <maxwell at umiacs.umd.edu>wrote:
>
>>
>>
>> On 4/28/2014 1:10 AM, Benjamin Barrett wrote:
>> > For Lushootseed, I think we calculated that with various prefixes, there
>> > should be less than 120 forms (which is about the Latin count, I think),
>> > which is a reasonable count. It's nice to have one page for every form
>> > so people can look up whatever form they have at hand, but if you have a
>> > language with hundreds of forms per verb, then you might have to
>> > consider whether you want to pare it down to keep your database small
>> > (though obviously Wikipedia and Wiktionary are huge).
>>
>> With a count like that, you probably want a morphological parser/
>> generator to create the forms (otherwise you inflate the number of verbs
>> that you need to enter by two orders of magnitude). FLEx has such a
>> parser built in.
>>
>> A finite state transducer (like xfst/lexc or FOMA, or sfst) allows both
>> parsing and generation from the same rule set. If you can express the
>> rules (the morphotactics, plus the phonological rules that create
>> allomorphs) in the xfst or sfst formalism, and export the lexical
>> entries from your dictionary, then it's not too hard. With an
>> appropriate interface to your web page, you can automatically call the
>> parser on forms the user types in. Dunno if you can do that with
>> Wiktionary.
>>
>> IIRC, Lushootseed has reduplication, although perhaps you've accounted
>> for that by listing the various reduplicated forms.
>>
>> FWIW, I would suggest creating some kind of test program to ferret out
>> broken lexical entries. With free-form entry like Wiktionary (or
>> Toolbox), erroneous entries (entries with missing fields, fields in the
>> wrong order, etc.) are bound to arise.
>>
>>
>
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20140428/c8a0fcf3/attachment.htm>


More information about the Lexicography mailing list