[Corpora-List] Wikileaks Pager Corpus

Trevor Jenkins trevor.jenkins at suneidesis.com
Tue Apr 12 09:41:35 UTC 2011


Having mentioned, in a response to Laura's request for text messages, the
Wikileaks 9/11 pager corpus and how difficult it can be to locate I went
in search of it.

This page http://mirror.wikileaks.info/wiki/911/ will get you to the data.

The data is segmented into separate files of 5 minute time slices(*).
Coverage is said to be continuous from 3AM on 11th September to 3AM on the
12th. I have not checked the sequence myself. There are approximately half
a million texts. The exact number is disputed.

Each message is a single line of text. The format takes some getting used
to but basically it is Date (in ISO format), time, service operator, pager
number, code(s) that identify the message content/encoding, following by
the message itself. The codes vary a little between the operators but are
not difficult to unravel.

There is a reddit page linked to the above. Several commentators detail
the format. A few ``conversations'' are highlighted there too. Selecting
the actual messages by pager number will show these clearly. There's also
the usual speculation and conspiracy drizzle.

The actual message content varies. Some are automated status messages
about trading systems that have gone offline. Some news reports of other
suspicious activities. Some are not in English; I spotted several in
Spanish. Some are encoded; a few appear to be (weakly) encrypted; a lot
more are quasi-MIME-encoded binary data. Some are in plain-text that
should be encrypted; there are messages to pagers, which have been traced
back to FBI, US Secret Service and similar agencies, receiving national
security intell. Some are personal. Some are from lovers conducting
affairs. Some are just plain weird.

Individual messages can appear incomplete and if you're processing the
data be careful of singleton quotes. A couple of reddit contributors
provided awk scripts to process the messages into CSV format. Another
created an SQLite dump. I have not checked that the awk scripts work
properly neither have I checked that the SQLite dump still exists or has
said content.

Another of the reddit commentators links to their own blog describing how
with a radio scanner and some simple hardware it is possible to scrap the
airwaves for current pager messages. The ethics of doing so are suspect.
And in the UK, where I am, such activity would be in breach of the
Telecommunications Act 1949(?) which prohibits the interception of
communications. There are similar pages in the blogosphere that document
how to intercept SMS messages in a similar fashion. But with the on-going
News International debacle over cracking of voicemail messages by
journalists at the News of the World cracking mobile phone transmissions
for SMS content is probably not a good idea.

(*) The original release was done in real-time with each batch of messages
made available at the same clock time as on the day.

As a possible aside, I see that the Singapore SMS corpus, which was also
mentioned in reply to Laura's enquiry, includes meta-data on the model of
phone being used. Comparing the style of some of those messages with the
Wikileaks pager messages I wondered if for the SMS one there was any
apparent stylistic difference in the content based solely upon the device
being used. There is no such meta-data for the pager messages other than
the service provider name.

Regards, Trevor

<>< Re: deemed!



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list