[Corpora-List] Building a corpus from Twitter & Tw's privacy concerns

Sasa Petrovic montyphyton at gmail.com
Wed Jul 17 11:51:54 UTC 2013


Making things anonymous doesn't help, you still can't distribute the 
tweets.  As far as scraping without the use of their API, I'm not sure what 
the legal situation is, but I am sure that Twitter will ask you to take 
down the data if you start sharing it.  Whether you want to get into a 
legal battle with them is another thing.
The core of the problem is that Twitter promises their users that they can 
delete their tweets, and when they do, the tweet should be deleted by 
anyone who has collected their data.  This is not possible if someone is 
distributing the raw tweets.

cheers,
Sasa

On Tuesday, July 16, 2013 6:27:20 PM UTC+2, Djamé Seddah wrote:
>
> Hi, 
> I was given to understand that if 
> a) the tweets were collected without an account, without using an API (raw 
> cut'n'paste), you were not legally bound to the term of use and 
> b) if they were made totally anonymous (which is the hard and tedious 
> part), you could redistribute some of them as a sample collection, thus 
> falling under the fair-use 
> usage scheme. 
>
> We've built a small social media Treebank for French -- used as a 
> statistical parsing stress test -- and we never managed to get an answer 
> (by phone or by mail) from twitters (and from Facebook either) while we had 
> agreement from other sources. 
>
> Best, 
> Djamé 
>
>
>
>
> Le 16 juil. 2013 à 17:44, John D. Burger a écrit : 
>
> > There appears to be no legal reason you can't collect a corpus of 
> tweets.  However, per Twitter's Terms of Use you cannot redistribute the 
> tweets to others.  A common practice is to instead distribute the tweet 
> IDs, which other people can use to fetch the tweets using Twitter's API. 
>  This is how NIST "distributes" the data in their Tweets2011 corpus: 
> > 
> > http://trec.nist.gov/data/tweets/ 
> > 
> > This is less than optimal for research, though, since in the interim 
> some of the Twitter users may have deleted tweets in the collection. For a 
> sufficiently large corpus, this means that anybody else attempting to use 
> the same data at a later date will almost certainly end up with a subset of 
> your corpus. As far as I know, however, this is currently the only legal 
> method for sharing tweets. 
> > 
> > - John Burger 
> >  MITRE 
> > 
> > On Jul 16, 2013, at 10:51 , M.E.Sciubba wrote: 
> > 
> >> Dear ListMembers, 
> >> 
> >> I'd like to create a corpus of Italian twits, but searching online I 
> found out that it is not possible anymore because Twitter has changed its 
> privacy settings. 
> >> 
> >> Has any of you tried to build a Twitter corpus and how? 
> >> 
> >> Any suggestion will be much appreciated (considering that I am not a 
> programmer, though). 
> >> 
> >> Best, 
> >> 
> >> Eleonora 
> >> 
> >> 
> >> 
> >> Dr. Maria Eleonora Sciubba 
> >> Associate Researcher 
> >> Archivio di LInguA Spontanea 
> >> tel. +32 16 3 24795 
> >> cell +32 483 616 114 
> >> 
> >> KU Leuven – Faculty of Arts 
> >> 
> >> Department of French, Italian and Comparative Linguistics 
> >> 
> >> Blijde-Inkomststraat 21, PO BOX 3308 
> >> 
> >> B - 3000 Leuven 
> >> 
> >> http://www.kuleuven.be/wieiswie/nl/person/00088846 
> >> 
> >> 
> >> 
> >> 
> >> Be green. Keep it on the screen 
> >> _______________________________________________ 
> >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora 
> >> Corpora mailing list 
> >> Cor... at uib.no <javascript:> 
> >> http://mailman.uib.no/listinfo/corpora 
> > 
> > _______________________________________________ 
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora 
> > Corpora mailing list 
> > Cor... at uib.no <javascript:> 
> > http://mailman.uib.no/listinfo/corpora 
>
>
> _______________________________________________ 
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora 
> Corpora mailing list 
> Cor... at uib.no <javascript:> 
> http://mailman.uib.no/listinfo/corpora 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130717/3a924fac/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list