<div dir="ltr"><div><div><div><div>Hi all,<br><br></div>For my master's thesis I want to compare relationship-based community detection methods with text-based methods. Hence, I need a corpus that includes both.<br><br>


Currently, I'm thinking of the Enron email dataset. It includes relationships (who mailed whom?) and text (the actual emails). It has a few issues though:<br><br></div>- Users can have multiple email addresses.<br></div>


- Not all text is produced by the sender of the email/humans (think quotes, signatures, spam and whatnot).<br><br></div>Does anyone have access to a cleaned up dataset that includes both, or perhaps a script that cleans up email text to include only content representative of the email sender? Alternatively, a different clean dataset that includes both people relationships and text produced by those persons - e.g. Twitter comes to mind?<br clear="all">


<div><div><div><div><div><br></div><div>Thanks in advance,<br></div><div>-- <br>Vincent


</div></div></div></div></div></div>