[Corpora-List] SMS corpora?
Trevor Jenkins
trevor.jenkins at suneidesis.com
Sun Sep 4 10:09:54 UTC 2011
On 3 Sep 2011, at 23:05, Yorick Wilks wrote:
> Is anyone aware of an easily obtained corpus of (semi!!)English SMS
> messages?
> Id be grateful for pointers.
A quick and naive search with Google found me the Singapore corpus
that has already been mentioned in response to your question.
Several others seem to be available too. The last time someone asked
here for an SMS corpus I suggested the pager messages of 9/11
available from WikiLeaks. They are one-sided "conversations" rather
than bi-lateral discourse.
You might also have success on the text mining page(s) of the
knowledge discovery nuggets web site at www.kdnuggets.com They have a
dataset of almost 6,000 SMS spam messages available.
Recently the Guardian constructed a "database" of Tweets sent during
the recent riots in Britain. If that becomes available it might also
serve your purpose(s). There are hooks in Twitter to extract volume
text so constructing such corpora becomes easier.
I wonder how convergent technology now affects SMS messages. Are
Tweets/tumbles and other social networking restricted-length cmments
any different from SMS messages these days? As someone who uses
neither methods (a few SMS messages to Deaf colleagues) my
observation is that of an outsider.
And even when you've found.established your corpus of messages how
will you deal with the lexical issues of l33t and txt spk? Does the
ue of l33t and txt spk affect your definition of semi-English?
Regards, Trevor.
<>< Re: deemed!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110904/101a3816/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list