[Corpora-List] Twitter datasets available.

Miles Osborne miles at inf.ed.ac.uk
Mon Jun 28 18:17:36 UTC 2010

We have made available for download various Twitter-related material:

--97 Million Tweets
--meta-information about the users who posted the Tweets
--Tweets annotated as corresponding with a news event, Spam or otherwise.


The first two sets of data are anonymised;  more details about the
construction can be found here:

Sasa Petrovic, Miles Osborne and Victor Lavrenko. The Edinburgh
Twitter Corpus. Computational Linguistics in a World of Social Media
(workshop at NAACL), Los Angeles, USA. June 2010.

The events dataset is from a later period and was used in our NAACL 10 paper:

Sasa Petrovic, Miles Osborne and Victor Lavrenko. Streaming First
Story Detection with application to Twitter. NAACL, Los Angeles, USA.
June 2010.


The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list