23.2009, FYI: New Linguistic Corpus of Sina Weibo Messages

linguist at linguistlist.org linguist at linguistlist.org
Tue Apr 24 16:09:44 UTC 2012


LINGUIST List: Vol-23-2009. Tue Apr 24 2012. ISSN: 1069 - 4875.

Subject: 23.2009, FYI: New Linguistic Corpus of Sina Weibo Messages

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is a non-profit organization dedicated to providing the
discipline of linguistics with the infrastructure necessary to function in
the digital world. Donate to keep our services freely available!
https://linguistlist.org/donation/donate/donate1.cfm

Editor for this issue: Kristen Dunkinson <kristen at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.


Date: Tue, 24 Apr 2012 12:08:55
From: Daan van Esch [daanvanesch at gmail.com]
Subject: New Linguistic Corpus of Sina Weibo Messages

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-2009.html&submissionid=4545327&topicid=6&msgnumber=1
 
It is my pleasure to announce to you the Leiden Weibo Corpus (LWC), 
an annotated linguistic 100-million word corpus containing 5.1 million 
messages from Sina Weibo, China's premier Twitter-like microblogging 
service. 

The LWC is freely available online at http://lwc.daanvanesch.nl/. Data 
for the LWC was collected in January 2012. As such, it contains many 
linguistic phenomena that may not be found in older corpora, such as 
suffixation with "-ing", an aspect marker borrowed from English. 

Furthermore, Sina Weibo messages come with valuable meta data, 
such as the gender of the user and his location. This information allows 
the LWC to calculate how often words are used in different provinces 
and cities across China, which is useful for research into lexical 
variation across China. 

Naturally, the LWC also supports searching for single words or 
grammar patterns, such as "any verb followed by an aspectual particle 
and then a noun". This feature may also be of interest to students and 
teachers of Mandarin who are looking for example sentences. 

Please feel free to forward this announcement to anyone who might be 
interested. Any feedback regarding the LWC would be greatly 
appreciated; please send it to daanvanesch at gmail.com.

Best wishes,

Daan van Esch
Graduate Student in Chinese linguistics
Leiden University 



Linguistic Field(s): Text/Corpus Linguistics





 






----------------------------------------------------------
LINGUIST List: Vol-23-2009	
----------------------------------------------------------



More information about the Linguist mailing list