[Corpora-List] Re. Concordancer for Chinese (Summary of reply)

Mon Oct 7 12:38:50 UTC 2002

The CKIP corpus, which is segmented, comes with its own concordancer which finds words.

There is a corpus of PRC XinHua newswires, called PH, which was built in Singapore and I seem to remember is available on a CD published by the University of Edinburgh. In this corpus, words are delimited by a slash, or something of the sort, so you could use the kind of regexp function suggested by others with this, and probably other segmented corpora too. Thus you would be looking for a pattern bounded by slashes instead of white space.

If you were using a non-segmented corpus, you might be tempted to just search for strings of characters which you know are words. But this could be dangerous: if you were looking for the word "guo-wen" (national literature/Chinese literature as a school subject in Taiwan) you would find a token in "ying-guo wen-hua" (British culture)!

I might have said this before, but bear in mind that there is no absolute consensus on what constitutes a "word" in Chinese, perhaps because it's not really a linguistic category that native speakers work with. That means that when you're using a segmented corpus, your data is to some extent contaminated by the corpus builders' theoretical position on wordhood. That said, though, the same applies to all sorts of other annotations to corpora generally.