[Corpora-List] Searching Japanese corpora

Wed Dec 20 23:35:09 UTC 2006

Greetings,

Following up on our recent thread about grep with Unicode, I'm curious
about how people search for text in Japanese-language corpora.

My understanding of Japanese is rudimentary, but is it not possible
(potentially at least) for the same text to be written in hiragana,
katakana, or kanji?  In order to find all occurrences of a particular
string in a corpus, would I have to do the search 3 times, once for
each script?  I assume that would be the case for something like grep.
 But are there more sophisticated query tools which abstract away the
question of which script is actually used for data within the corpus?

Thanks,

Eric J. M. Smith
Dept. of Linguistics
University of Toronto