[Corpora-List] stopword lists for norwegian and danish

Eric Ringger ringger at cs.byu.edu
Sat Feb 16 00:24:57 UTC 2008


Nice perspective.  Thanks, Trevor.

In the practice of text classification, we prefer to use feature selection
or even distributional word clustering as a better way of managing feature
vector sizes, if necessary.

Regards,
--Eric

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Trevor Jenkins
Sent: Thursday, February 14, 2008 11:34 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] stopword lists for norwegian and danish

On Mon, 11 Feb 2008, Roxana Angheluta <roxana at attentio.com> wrote:

> I am looking for stopword lists for Norwegian and Danish.

Sorry can't help with lists. But I'd like to swap hats from corpora to
computing science and enquire why you're looking for such lists.

Back when I worked in the R&D of a major text retrieval system we
deliberately did not support stop-lists; they increased the code
complexity, the original intent behind stop-lists (of reducing the size of
inverted index files) was no longer relevant with large discs and
compressed indices, and more importantly very few end-users understood
their purpose. During my 15 years with the company we only encountered one
real requirement for imposing stop lists, which was to obfuscate
controversial word usage by a British PM.

There are some anecdotal examples of English phrases where stop lists
should not be applied: "Lloyds of London" (the insurance market), "Prince
of Wales". The Lloyds example is particularly troublesome because Lloyds
of London is situated in the City of London not far from the headquarters
of Lloyds Bank, which is a separate institution, and around the corner
from Lloyds the chemist. Removing "of" would reduce the adjacency of the
words Lloyds London in all three examples into ambiguity that cannot be
resolved easily if at all.

YMMV.

Regards, Trevor

<>< Re: deemed!



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list