[Corpora-List] threshold for hapax legomena

Ed Kenschaft ekenschaft at gmail.com
Fri May 25 14:58:14 UTC 2007


On 5/25/07, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>
> (Thresholds are appealing as they reduce the computational scale of the
> task, so lots of people have tried to use them to make tasks more
> manageable: however performance has usually suffered.)


This has been my consistent anecdotal experience in NLP tasks: Dropping rare
items can dramatically reduce the size/time required, but at a cost in
accuracy. You'll need to decide empirically what trade-off is worth it for
you. There are many applications where dropping true hapax legomena
(singletons) cuts your complexity by 50% with a negligible cost to
performance, but you can't count on it until you try it.

Another measure that I've found helpful for some applications, related to
IDF, is S-score (Babych & Hartley
2004<http://www.citeulike.org/user/ekenschaft/article/599582>,
2003). Conceptually the opposite of hapax legomena, this identifies function
words that are too prevalent to be interesting.

Cheers.

-Ed

-- 
Ed Kenschaft
http://www.kenschaft.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070525/5d73c837/attachment.htm>


More information about the Corpora mailing list