[Corpora-List] Building a corpus from Twitter & Tw's privacy concerns

Miles Osborne miles at inf.ed.ac.uk
Thu Jul 18 07:55:00 UTC 2013


Basically Twitter's insistence on distributing IDs and not raw Tweets stems
from the fact that third parties need to honour deletion requests.

If you pass around raw Tweets then there is no way for Twitter to argue
that a deleted Tweet is deleted. If instead you force people to recrawl
them each time then Tweets can be deleted at source and all subsequent
access requests will not return that deleted Tweet.

Personally I think this way of distributing Tweets in bulk is not scalable
and acts as a barrier to research.  Additionally one could argue that
preventing people from having access to static Tweet corpora undermines
doing reproducible research.

Miles

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130718/2beaa163/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list