[Corpora-List] Clean corpus including user relationships (Enron? Twitter?)

Vincent mailinglists at vinnl.nl
Sun Apr 6 13:42:20 UTC 2014


Hi all,

For my master's thesis I want to compare relationship-based community
detection methods with text-based methods. Hence, I need a corpus that
includes both.

Currently, I'm thinking of the Enron email dataset. It includes
relationships (who mailed whom?) and text (the actual emails). It has a few
issues though:

- Users can have multiple email addresses.
- Not all text is produced by the sender of the email/humans (think quotes,
signatures, spam and whatnot).

Does anyone have access to a cleaned up dataset that includes both, or
perhaps a script that cleans up email text to include only content
representative of the email sender? Alternatively, a different clean
dataset that includes both people relationships and text produced by those
persons - e.g. Twitter comes to mind?

Thanks in advance,
-- 
Vincent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140406/0815bdef/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list