[Corpora-List] SMS corpora?

Tao Chen taochen at comp.nus.edu.sg
Wed Sep 7 08:34:38 UTC 2011


Hi Yorick, all:

Greetings from NUS.  My name is Tao Chen, a second year Ph.D. student
working on SMS corpus collection.  Our old 2004 NUS SMS Corpus (found
at http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/)
has already been mentioned by Yunqing in his reply (Thanks to Yunqing
for the pointer!)

But I'd like to point out that our group (in specific, me) has
resurrected the SMS collection project as of October 2010, reviving it
as a live corpus project for gathering multilingual SMS, currently
focusing on English and Mandarin Chinese SMS.  Up to now, we have
collected 28,724 English SMS and 28,869 Chinese SMS and have been
releasing a new version of the corpus and its summary statistics for
it on a regular monthly release schedule.

http://wing.comp.nus.edu.sg/SMSCorpus/

Importantly, this corpus is freely available for any use, including
commercial and research purposes and is in the public domain.  The
latest version of the corpus is downloadable from our schools Research
to Market portal (which requires registration just for record keeping
purposes).  Past versions (<1+ month old) are freely available as
simple download links on the corpus webpage.

The corpus was collected under NUS IRB exemption policy (#10-481) and
important identifiers in the corpus have been replaced by placeholder
tokens for deidentification purposes.

You may also be interested in our draft article in preparation about
the corpus creation, which also contains a comprehensive  literature
review of existing SMS corpora as well.  If you (or others) are
interested in the details, we are most happy to share the draft to
you.

We'd like to encourage you and any others interested, to contribute to
the corpus.  We have experimented with a number of collection methods
in our study and are documenting it in the draft article.  Finally, if
you have any suggestions in improving the corpus collection process
for SMS or how it might be changed to better serve your research,
please do get in touch with us.  We really want to know how to make
this corpus more useful to SMS studies of all different natures.

Sincerely,

Tao CHEN
on behalf of the Web IR / NLP Group (WING) at NUS
http://www.comp.nus.edu.sg/~taochen/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110907/ab24b958/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list