[Corpora-List] Comparing n-grams / authorship

Yuval Marton yuvalmarton at gmail.com
Wed Apr 18 17:53:12 UTC 2012


A few years ago, there was some controversy regarding whether one could use
off-the-shelf compression techniques such as gzip for authorship
attribution.  If you are interested in trying out something like that, you
might want to have a look at

Yuval Marton, Ning Wu, and Lisa Hellerstein. "On Compression-Based Text
Classification". Proceedings of the 27th European Conference on Information
Retrieval (ECIR), Spain, March 2005.
Abstract<http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/finalforwebabs.pdf>.
Full paper here <http://www.springerlink.com/index/M24DHW0XDMREHE64>
or here<http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/final.pdf>.
Click here for the errata
note!<http://www1.ccls.columbia.edu/%7Eymarton/pub/ecir05/onCompressionForClassificationErrata4.htm>

... and the papers cited within.

Pro: very easy to use. Simply split each chapter to a "training" and "test"
parts, and follow recipe in the paper.
Con: Not the most standard way authorship attribution tasks are usually
carried out.


Yuval Marton
IBM Research



On Tue, Apr 17, 2012 at 3:47 PM, Mark Davies <Mark_Davies at byu.edu> wrote:

> I am sending the following question on behalf of a colleague at BYU.
> Thanks in advance for any suggestions you have; I'll forward them to the
> researcher who is working on this problem.
>
> Mark Davies, BYU
>
> -------------------------------------------
>
>
> I am working with a 250,000 word text. Within this text there are two
> chapters, A and B (1,200 and 2,400 words respectively). The authorship of
> these two chapters is unknown, but we have reason to believe to that the
> author(s) of A and B have a relationship that is different from the
> majority of the rest of the book. There are two 4-grams, three 6-grams, one
> 7-gram, one 8-gram, and one  9-gram shared in common in chapters A and B
> that appear nowhere else in the book. Intuitively it seems like there is a
> unique relationship between chapters A and B.
>
> The question is:
>
> Is there a statistical method of measuring whether the types of n-grams
> above establish a reasonable probability that the two texts are linked.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120418/c9388d08/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list