[Corpora-List] stopword lists for norwegian and danish

Trevor Jenkins trevor.jenkins at suneidesis.com
Thu Feb 14 18:34:28 UTC 2008


On Mon, 11 Feb 2008, Roxana Angheluta <roxana at attentio.com> wrote:

> I am looking for stopword lists for Norwegian and Danish.

Sorry can't help with lists. But I'd like to swap hats from corpora to
computing science and enquire why you're looking for such lists.

Back when I worked in the R&D of a major text retrieval system we
deliberately did not support stop-lists; they increased the code
complexity, the original intent behind stop-lists (of reducing the size of
inverted index files) was no longer relevant with large discs and
compressed indices, and more importantly very few end-users understood
their purpose. During my 15 years with the company we only encountered one
real requirement for imposing stop lists, which was to obfuscate
controversial word usage by a British PM.

There are some anecdotal examples of English phrases where stop lists
should not be applied: "Lloyds of London" (the insurance market), "Prince
of Wales". The Lloyds example is particularly troublesome because Lloyds
of London is situated in the City of London not far from the headquarters
of Lloyds Bank, which is a separate institution, and around the corner
from Lloyds the chemist. Removing "of" would reduce the adjacency of the
words Lloyds London in all three examples into ambiguity that cannot be
resolved easily if at all.

YMMV.

Regards, Trevor

<>< Re: deemed!



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list