[Corpora-List] Call for SMS Contributi=?utf-8?Q?=E2=80=8Bon_?=for a Public Research Corpus

Tao Chen taochen at comp.nus.edu.sg
Tue Apr 5 13:53:06 UTC 2011


Dear members of the corpora community:

We are seeking your help to enlarge a freely available public corpus
of SMS messages.  In the last few months, at the National University
of Singapore (NUS), we have been working to collect a live corpus of
SMS (Short Message Service) messages.  In fact, previously in 2004, we
had made a corpus of messages (~10,000 messages in English, mostly
from Singaporeans) available to the public for study.

We have restarted the 2004 project since last October, aiming at
enlarging the corpus both in depth and breadth.  We are collecting
better demographic information, timestamps, recipient and sender
identity (appropriately anonymized) and including this with the
corpus' messages.  Up to now, we have collected over 21,000 new
English messages and 10,000 Chinese messages.  Most messages are
tagged with metadata about the sender's profile (gender, age, country,
years of using SMS, number of SMS sent daily, etc.). The corpus is
being versioned and released on a monthly basis, and is free for all
communities to use.   New releases are made on a monthly basis, since
the corpus collection process is live and the corpus is growing. For
detailed information about our corpus, please visit our NUS SMS Corpus
site at: http://wing.comp.nus.edu.sg:8080/SMSCorpus.

We write this email to seek your help, either directly or indirectly,
to ask for your contribution to build this public resource.  SMS
messages still continue to be a vital, sensitive and important vehicle
for personal communication which many of us use on a daily basis.  Up
to now, scholars do not have access to a large, freely available SMS
corpus to study and most research on SMS has been done with
collaboration with private companies who have strict non-disclosure
agreements, making comparative SMS research impossible.

As SMS are potentially sensitive and identity-revealing, our
collection framework tries to anonymize sensitive data in messages,
such as telephone numbers, email addresses and other identifiers,
before accepting them into the corpus.  This is a legitimate attempt to
collect and enlarge an SMS corpus for the public good, and if you are
concerned about the legitimacy of our project, please visit our webpage
 first.  Additionally, this study has been exempted from NUS' institutional
review board (IRB) panel for human studies protocols.

Such a public corpus needs your contribution, as most of us are
senders of SMS. With a larger base of contributors and a growing
number of messages archived, the corpus will grow in depth and utility
to scholars everywhere.


Currently, there are three methods for you to contribute SMS messages
to the public corpus. Please refer to the "Contribution" page from our
project page at http://wing.comp.nus.edu.sg:8080/SMSCorpus/ for
detailed information.  We summarize them below.

* Android phone owners - Please install our App "SMS Collection for
Corpus" from the Android market (authored by Web IR/ NLP Group @ NUS).
Follow the app's instructions to submit SMS to us.  The software will
create a draft message with your SMSes to send to us; you will have a
chance to censor or delete messages that you do not want to
contribute.

* Nokia phone owner - Please use Nokia PC Suite to export SMS as a CSV
file. The PC Suite software is available from our project page.  Then
send the file to SMS.Donation at gmail.com.

* Other brand phone owner - You can type your messages in the
contribution site's web page. Or export your SMS as a file(eg. CSV
file) if you know some software can help you do so, then sent the file
to SMS.Donation at gmail.com.

(We currently do not have an automated donation method for the iPhone,
sorry!)


If you have any questions or suggestions, please feel free to contact
me.  We sincerely appreciate your suggestions and contributions!

-- 
Tao CHEN

PhD Candidate
Web IR / NLP Group (WING), School of Computing
National University of Singapore
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110405/c0e54fc8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list