<div dir="ltr">Hello friends,<div><br></div><div>I am working on a project calculating word frequencies in English basal reading programs.</div><div><br></div><div>We have a desire to report individual word form frequencies as well as several levels of word family frequencies. We would also like to be able to differentiate between inflected and derived forms. I've considered Nation's 7 levels of word relations but we only need three, or maybe four levels: root word, inflected, common derivations, all derivations.</div>
<div><br></div><div>I've explored the use of all of the stemmers available in NLTK and an approach using Morfessor.</div><div><br></div><div>So far I'm not finding one obvious winner as all of them seem to have some different problems.</div>
<div><br></div><div>What I'd love is a set of word forms with associated stems or roots and (optionally) whether the form is inflected or derived to test the accuracy of the various approaches. In addition to this a set of word families with all the appropriate inflected and derived forms.</div>
<div><br></div><div>I'm prepared to collect at least a small sample of this data myself, but am curious if a resource exists already.</div><div><br></div><div>Thanks!</div></div>