A few years ago, there was some controversy regarding whether one could use off-the-shelf compression techniques such as gzip for authorship attribution.  If you are interested in trying out something like that, you might want to have a look at <br>

<br><div style="margin-left:40px">Yuval Marton, Ning Wu, and Lisa Hellerstein. "On

Compression-Based Text Classification". Proceedings of the 27th European

Conference on Information Retrieval (ECIR), Spain, March 2005. <a href="http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/finalforwebabs.pdf">Abstract</a>. <a href="http://www.springerlink.com/index/M24DHW0XDMREHE64">Full paper here</a>

or <a href="http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/final.pdf">here</a>. <a href="http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/onCompressionForClassificationErrata4.htm">Click here for the

errata note!</a></div><br>... and the papers cited within. <br><br>Pro: very easy to use. Simply split each chapter to a "training" and "test" parts, and follow recipe in the paper.<br>Con: Not the most standard way authorship attribution tasks are usually carried out.<br>

<br><br>Yuval Marton<br>IBM Research<br><br><br><br><div class="gmail_quote">On Tue, Apr 17, 2012 at 3:47 PM, Mark Davies <span dir="ltr"><<a href="mailto:Mark_Davies@byu.edu">Mark_Davies@byu.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.<br>


<br>

Mark Davies, BYU<br>

<br>

-------------------------------------------<br>

<br>

<br>

I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one  9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.<br>


<br>

The question is:<br>

<br>

Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div><br>