[Corpora-List] Building a corpus from Twitter & Tw's privacy concerns

Miles Osborne miles at inf.ed.ac.uk
Thu Jul 18 08:43:43 UTC 2013


This is a bit of a digression but it also underlines why building a
start-up (which is similar to doing academic Social Media research) using
Twitter data is a very risky business.  As a community we should try to
identify other Social Media streams and so not be so dependent upon one
company.

Adam: Privacy is key, I agree and is something that I am working on now.
 Mechanisms for distributing data --whilst making guarantees about which
information can be inferred from it-- should be the next step.  Whether
society as a whole allows for research using this data is a different
question however and out of my control.

Miles


On 18 July 2013 09:33, Miguel Almeida <miguelbalmeida at gmail.com> wrote:

> Adam, Miles,
>
> I think another reason is so that Twitter can "black out" everyone else at
> any time in the future. It's a great (and very selfish and narrow-minded)
> idea: let the research community publish papers with your data, showing you
> how to find interesting stuff in your data (using taxpayer money!), and
> then if at some point you want to black them out, use the kill switch.
>
> I don't think Twitter's owners care that much about reproducible research.
> ;)
>
> Miguel
>
>
> On Thu, Jul 18, 2013 at 9:26 AM, Adam Kilgarriff <adam at lexmasterclass.com>wrote:
>
>> Miles,
>>
>> > acts as a barrier to research.  Additionally one could argue that
>> preventing people from having access to static Tweet corpora
>> > undermines doing reproducible research.
>>
>> You can argue all you like but it's a bit irrelevant -  the data privacy
>> battleground is the whole wide world, with hi-tech companies, politicians
>> and the media playing for big prizes, and they really won't care one jot
>> what us worker ants think (or if they trample us)
>>
>> adam
>>
>> On 18 July 2013 08:55, Miles Osborne <miles at inf.ed.ac.uk> wrote:
>>
>>> Basically Twitter's insistence on distributing IDs and not raw Tweets
>>> stems from the fact that third parties need to honour deletion requests.
>>>
>>> If you pass around raw Tweets then there is no way for Twitter to argue
>>> that a deleted Tweet is deleted. If instead you force people to recrawl
>>> them each time then Tweets can be deleted at source and all subsequent
>>> access requests will not return that deleted Tweet.
>>>
>>> Personally I think this way of distributing Tweets in bulk is not
>>> scalable and acts as a barrier to research.  Additionally one could argue
>>> that preventing people from having access to static Tweet corpora
>>> undermines doing reproducible research.
>>>
>>> Miles
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>
>>
>> --
>> ========================================
>> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
>> adam at lexmasterclass.com
>> Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
>>
>> Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk>
>>
>> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>>
>>                         *DANTE: a lexical database for English<http://www.webdante.com>
>>                   *
>> ========================================
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130718/779b5ce8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list