[Corpora-List] Searching Japanese corpora

Fri Dec 22 00:10:38 UTC 2006

Hi,

I'm sure there's a qualified Japanese speaker out there who can tell  
us this with authority (I'm not that person), but my understanding is  
that there is a canonical form for words.

Katakana is used exclusively for foreign words
Kanji (+ Hiragana modifers) is used for Japanese words.
Words are only spelt out in Hiragana in beginners' and learners'  
texts, normally in small type above the canonical Kanji form.

This sort of problem exists in English too, but of course to a much  
lesser degree.  Do you search for "%" or "per cent"?  "3rd" or  
"third"?  "&" or "and"?

Cheers,
Brett

On 22/12/2006, at 4:11 AM, Cyrus Shaoul wrote:

> Hi Eric,
>
> It is my understanding that it is possible to write the  
> pronunciation of all
> kanji and kanji compounds in both hiragana and katakana (and each  
> kanji/kanji compound can
> have multiple pronunciations). In most types of written Japanese, it
> would be uncommon to write the pronunciation for kanji, and there  
> are many words that are
> always written in katakana or hiragana, and never in kanji, so when  
> searching for words, having a tool that
> would automatically search for a kanji word and it's kana  
> representations at the same time would not
> be that useful.
>
> I should confess that there are some words that are written in both  
> kanji and kana with higher frequency, such as
> some older loanwords, some place names, some proper names, some low- 
> frequency kanji, and a few other types of words.
> I have a gut feeling that the number of words that fall into these  
> categories is not that large.
>
> I don't know of any tools out there to do the kind of query you  
> mentioned, but it has been a few years since I
> working on Japanese text. In the meantime, I can only suggest  
> making many queries, one with kanji/kanji compund and others with  
> the hiragana and katakana spellings of all the possible  
> pronunciations.
>
> Yours,
>
> Cyrus
>
> http://www.psych.ualberta.ca/~westburylab/
>
> Eric J. M. Smith wrote:
>> Greetings,
>>
>> Following up on our recent thread about grep with Unicode, I'm  
>> curious
>> about how people search for text in Japanese-language corpora.
>>
>> My understanding of Japanese is rudimentary, but is it not possible
>> (potentially at least) for the same text to be written in hiragana,
>> katakana, or kanji?  In order to find all occurrences of a particular
>> string in a corpus, would I have to do the search 3 times, once for
>> each script?  I assume that would be the case for something like  
>> grep.
>> But are there more sophisticated query tools which abstract away the
>> question of which script is actually used for data within the corpus?
>>
>> Thanks,
>>
>> Eric J. M. Smith
>> Dept. of Linguistics
>> University of Toronto
>>
>

--------------------------------------------------------------
Brett Powley -- PhD Candidate
Centre for Language Technology, Macquarie University,  Australia
w: http://www.ics.mq.edu.au/~bpowley
faciendi plures libros nullus est finis
frequensque meditatio carnis adflictio est
--------------------------------------------------------------