On 5/25/07, <b class="gmail_sendername">Adam Kilgarriff</b> <<a href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a>> wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

(Thresholds are appealing as they reduce the computational scale of the task, so lots of people have tried to use them to make tasks more manageable: however performance has usually suffered.)</blockquote><div><br>This has been my consistent anecdotal experience in NLP tasks: Dropping rare items can dramatically reduce the size/time required, but at a cost in accuracy. You'll need to decide empirically what trade-off is worth it for you. There are many applications where dropping true hapax legomena (singletons) cuts your complexity by 50% with a negligible cost to performance, but you can't count on it until you try it.

<br><br>Another measure that I've found helpful for some applications, related to IDF, is S-score (<a href="http://www.citeulike.org/user/ekenschaft/article/599582">Babych & Hartley 2004</a>, 2003). Conceptually the opposite of hapax legomena, this identifies function words that are too prevalent to be interesting.

<br><br>Cheers.<br><br>-Ed<br clear="all"></div></div><br>-- <br>Ed Kenschaft<br><a href="http://www.kenschaft.org">http://www.kenschaft.org</a><br>