[Corpora-List] text message corpus - clarification

Christopherson, Laura llchrist at email.unc.edu
Mon Apr 11 18:41:36 UTC 2011


Hi All,

I am responding to the requests for clarification about my earlier request
for an English-only text messaging corpus. Thanks so much for reading and
responding. I definitely need to be more specific!

A couple of points were raised about the notions of "text messages" and
"personal." I will try to clarify these points.

When I used the term "text messages," I meant it in a specific way (not a
general usage of "things/documents/files in text"). Specifically, I meant
SMS (short messaging service) as Benjamin indicated - messages created on
cellphones via a service provider's (like AT&T) service for this sort of
communication. 

Regarding the "personal" idea, absolutely yes - ultimately each message is
personal to someone. I'm more interested in text messages that are not a
collection of messages which are personal **to the collector** - i.e. not
the collector's own messages to/from his family/friends or messages that
are created by only the collector's family/friends. For instance, Caroline
Tagg has an awesome corpus of SMS messages; but with the exception of a
small subset of that corpus, all messages are from people she knows
personally (family/friends). On an opposite tack, there is the NUS SMS
corpus that was recommended by John. As I understand this, the situation
under which this corpus was created was one where students (not
necessarily personal friends/family of the collector) submitted messages
to the collector. So I consider this "non-personal." (Does this make
sense?)

While the NUS SMS corpus satisfies the "non-personal" requirement, it
doesn't satisfy the English-only requirement. I had originally intended to
use this but when I got into it, I realized I could not because there is
so much code-switching, even within a single message. I don't speak any of
the languages in Singapore and would be at a loss to make solid
distinctions between Netspeak (see David Crystal: Language and the
Internet) terms in English, Netspeak terms in some other language, or
non-Netspeak terms in a non-English language. Sigh - because it too is a
wonderful corpus. 

Susan's corpus and Trevor's suggestion of wikileaks may be right on target
for me if what I've hopefully clarified gels with your (Susan and
Trevor's) understanding of these text messages.

I really appreciate your help with this!

Laura




_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list