[Corpora-List] list of chinese stop words

Shachar Mirkin mirkins at macs.biu.ac.il
Thu Aug 16 10:27:11 UTC 2007


Well, things aren't that simple with Chinese. What counts for a word in a
Chinese text is not always that clear and any character can be used either
alone or as part of various words. To remove stop-words, you probably need
to better analyze the text, like segmenting it to words. If you wish not to
apply such processing, a possible direction is to consider the characters by
their frequency. Here's an example of a list of Chinese characters by their
frequency (it's Traditional characters, but lists of Simplified can also be
found):

http://zhongwen.com/x/tsai1.htm

The first characters on the list are mostly used as function words, but
that's a weak rule, as - for example - the last character on the first line
is the first of two characters used to write "China", so you clearly don't
want to throw this one.

More details or an example of a page would help.

Shachar


-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Alberto Lavelli
Sent: Thursday, August 16, 2007 11:36 AM
To: corpora at uib.no
Subject: [Corpora-List] list of chinese stop words

I am writing on behalf of a colleague of mine who works on removing
unwanted material from web pages (what is done for example in
Cleaneval, http://cleaneval.sigwac.org.uk).  
He's trying to find a freely available list of Chinese stopwords.  
Any pointers?

thanks in advance

	alberto

------------------
ITC -> dall'1 marzo 2007 Fondazione Bruno Kessler
ITC -> since 1 March 2007 Fondazione Bruno Kessler
------------------

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora





_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list