[Corpora-List] text message corpus - clarification

Trevor Jenkins trevor.jenkins at suneidesis.com
Mon Apr 11 23:22:09 UTC 2011


On Mon, 11 Apr 2011, Christopherson, Laura <llchrist at email.unc.edu> wrote:

> Regarding the "personal" idea, absolutely yes - ultimately each message is
> personal to someone. I'm more interested in text messages that are not a
> collection of messages which are personal **to the collector** - i.e. not
> the collector's own messages to/from his family/friends or messages that
> are created by only the collector's family/friends. For instance, Caroline
> Tagg has an awesome corpus of SMS messages; but with the exception of a
> small subset of that corpus, all messages are from people she knows
> personally (family/friends). ...

But if you had enough of these individual collections to work with the
implicit bias would disappear presuming little overlap between the senders
and receipients. Defining ``enough'' will be hard. Would 20, 200, 2,000,
20,000 different collections be sufficient?

And then there's the demographics of the people. Amongst friends who text
me I see a variety of styles based solely on demographics. Older senders
are more likely to write messages, younger ones to use l33t or txt spk
abbreviations. Messages to my phone also have very different content
depending upon whether the originator is Deaf or not. (I work as a
community sign language interpreter.) The Deaf senders tend toward brevity
but without using l33t/txt spk conventions; the hearing senders will play
with homophonic abbreviations like HOW R U? and C U L8R.

> Susan's corpus and Trevor's suggestion of wikileaks may be right on target
> for me if what I've hopefully clarified gels with your (Susan and
> Trevor's) understanding of these text messages.

The WikiLeaks collection is variously described as SMS or pager messages.
Both are short messaging systems but a major difference would be that SMS
could be bi-directional but pager uni-directional. Reconstructing SMS
dialogues might be difficult unless you had a stringent collection
protocol.

You may have to search for the WikiLeaks material as a) their website gets
overloaded and b) the pager material is embedded down several links and,
c) gets hit with DDoS attacks by disaffected and disgruntled objectors to
their activity.

Regards, Trevor

<>< Re: deemed!


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list