[Corpora-List] SMS corpora?

Trevor Jenkins trevor.jenkins at suneidesis.com
Sun Sep 4 10:09:54 UTC 2011


On 3 Sep 2011, at 23:05, Yorick Wilks wrote:

> Is anyone aware of an easily obtained corpus of (semi!!)English SMS  
> messages?
> Id be grateful for pointers.

A quick and naive search with Google found me the Singapore corpus  
that  has already been mentioned in response to your question.  
Several others seem to be available too. The last time someone asked  
here for an SMS corpus I suggested the pager messages of 9/11  
available from WikiLeaks.  They are one-sided "conversations" rather  
than bi-lateral discourse.

You might also have success on the text mining page(s) of the  
knowledge discovery nuggets web site at www.kdnuggets.com They have a  
dataset of almost 6,000 SMS spam messages available.

Recently the Guardian constructed a "database" of Tweets sent during  
the recent riots in Britain. If that becomes available it might also  
serve your purpose(s). There are hooks in Twitter to extract volume  
text so constructing such corpora becomes easier.

I wonder how convergent technology now affects SMS messages. Are  
Tweets/tumbles and other social networking restricted-length cmments  
any different from SMS messages these days? As someone who uses  
neither methods (a few SMS messages to Deaf colleagues) my  
observation is that of an outsider.

And even when you've found.established your corpus of messages how  
will you deal with the lexical issues of  l33t and txt spk? Does the  
ue of l33t and txt spk affect your definition of semi-English?

Regards, Trevor.

<>< Re: deemed!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110904/101a3816/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list