21.74, Qs: Japanese and English Corpora Research

linguist at LINGUISTLIST.ORG linguist at LINGUISTLIST.ORG
Thu Jan 7 05:51:23 UTC 2010


LINGUIST List: Vol-21-74. Thu Jan 07 2010. ISSN: 1068 - 4875.

Subject: 21.74, Qs: Japanese and English Corpora Research

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Monica Macaulay, U of Wisconsin-Madison  
Eric Raimy, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Elyssa Winzeler <elyssa at linguistlist.org>
================================================================  

We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 04-Jan-2010
From: Barry Kavanagh < b_kavanagh at auhw.ac.jp >
Subject: Japanese and English Corpora Research
 

	
-------------------------Message 1 ---------------------------------- 
Date: Thu, 07 Jan 2010 00:49:42
From: Barry Kavanagh [b_kavanagh at auhw.ac.jp]
Subject: Japanese and English Corpora Research

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=21-74.html&submissionid=2233274&topicid=8&msgnumber=1
  


I have a question regarding corpora if I may. At the moment I am looking at
non-verbal representations of language such as emoticons in computer
mediated discourse and have compiled a fairly large Japanese and English
corpus. As I am counting these non-verbal or paralinguistic cues within
these corpora the corpora need to be of the same size otherwise my data and
findings may be deemed void. For example, if the Japanese corpus if much
bigger than the English one then the chances are the more likely that these
non-verbal representations will appear. I have tried making the number of
sentences the same within each corpora (very time consuming, also defining
what a sentence is in online communication can be difficult) and I am also
trying to find similar studies that have compared English and Japanese
corpora (no luck yet) and to see if here are any reliable representations
that state for example that 400 kanji is equal to 1000 English words etc.

Any ideas or advice would be fantastic. 

Linguistic Field(s): Computational Linguistics

Subject Language(s): English (eng)
                     Japanese (jpn)




-----------------------------------------------------------
LINGUIST List: Vol-21-74	

	



More information about the LINGUIST mailing list