[Corpora-List] Word family or stemming validation data

Fri Feb 7 03:55:47 UTC 2014

Hello friends,

I am working on a project calculating word frequencies in English basal
reading programs.

We have a desire to report individual word form frequencies as well as
several levels of word family frequencies. We would also like to be able to
differentiate between inflected and derived forms. I've considered Nation's
7 levels of word relations but we only need three, or maybe four levels:
root word, inflected, common derivations, all derivations.

I've explored the use of all of the stemmers available in NLTK and an
approach using Morfessor.

So far I'm not finding one obvious winner as all of them seem to have some
different problems.

What I'd love is a set of word forms with associated stems or roots and
(optionally) whether the form is inflected or derived to test the accuracy
of the various approaches. In addition to this a set of word families with
all the appropriate inflected and derived forms.

I'm prepared to collect at least a small sample of this data myself, but am
curious if a resource exists already.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140206/dfebf6d5/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora