[Corpora-List] Twitter datasets available.

Miles Osborne miles at inf.ed.ac.uk
Mon Jun 28 18:17:36 UTC 2010


We have made available for download various Twitter-related material:

--97 Million Tweets
--meta-information about the users who posted the Tweets
--Tweets annotated as corresponding with a news event, Spam or otherwise.

http://demeter.inf.ed.ac.uk/

The first two sets of data are anonymised;  more details about the
construction can be found here:

Sasa Petrovic, Miles Osborne and Victor Lavrenko. The Edinburgh
Twitter Corpus. Computational Linguistics in a World of Social Media
(workshop at NAACL), Los Angeles, USA. June 2010.
http://www.iccs.inf.ed.ac.uk/~osborne/papers/socmed10.pdf

The events dataset is from a later period and was used in our NAACL 10 paper:

Sasa Petrovic, Miles Osborne and Victor Lavrenko. Streaming First
Story Detection with application to Twitter. NAACL, Los Angeles, USA.
June 2010.
http://www.iccs.inf.ed.ac.uk/~osborne/papers/naacl10a.pdf

Miles
Sasa
Victor

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list