[Corpora-List] Word family or stemming validation data

Jeff Elmore jelmore at lexile.com
Fri Feb 7 04:33:18 UTC 2014


I see now that I can get such data through the MorphoChallenge site.
http://research.ics.aalto.fi/events/morphochallenge2010/datasets.shtml

Apologies.

Are there any other sources I might consider?

Thanks again.


On Thu, Feb 6, 2014 at 10:55 PM, Jeff Elmore <jelmore at lexile.com> wrote:

> Hello friends,
>
> I am working on a project calculating word frequencies in English basal
> reading programs.
>
> We have a desire to report individual word form frequencies as well as
> several levels of word family frequencies. We would also like to be able to
> differentiate between inflected and derived forms. I've considered Nation's
> 7 levels of word relations but we only need three, or maybe four levels:
> root word, inflected, common derivations, all derivations.
>
> I've explored the use of all of the stemmers available in NLTK and an
> approach using Morfessor.
>
> So far I'm not finding one obvious winner as all of them seem to have some
> different problems.
>
> What I'd love is a set of word forms with associated stems or roots and
> (optionally) whether the form is inflected or derived to test the accuracy
> of the various approaches. In addition to this a set of word families with
> all the appropriate inflected and derived forms.
>
> I'm prepared to collect at least a small sample of this data myself, but
> am curious if a resource exists already.
>
> Thanks!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140206/dd371cfa/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list