[Corpora-List] Comparing n-grams / authorship

Justin Washtell lec3jrw at leeds.ac.uk
Tue Apr 17 21:17:36 UTC 2012


Hi Mark,

Statistics such as Log-Likelihood (see http://ucrel.lancs.ac.uk/llwizard.html), can give an indication of how significant are differences in observed freqeuencies of events.

These sorts of statistics assume a null-hypothesis in which everyhing is entirely random or unrelated, outside of which things are considered to be "significant". You need to be careful with this. Often in reality - as in your case I think - what you are looking for is actually more subtle.

For example, I would suggest that you will at least want to look at similar n-gram statistics derived from all other pairwise combinations of chapters in your particular corpus, to establish whether what is observed between A and B is somehow "special" in your case.

Also, I imagine the observed frequencies of those lower order n-grams which constitute your longer n-grams will have a bearing on how remarkable the figures are before you even start looking at the relative differences. For getting a handle on that, the language modelling literature may be useful.

Sorry I can not be more specific. I'm not a statistician :-)

Justin Washtell
University of Leeds


________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Yorick Wilks [Y.Wilks at dcs.shef.ac.uk]
Sent: 17 April 2012 21:03
To: Mark Davies
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Comparing n-grams / authorship

The questioner might want to look at the METER project: http://aclantho3.herokuapp.com/catalog/P02-1020
This was an attempt to determine if one text had been rewritten from another based on ngrams---in a journalism and press service context (rather than plagiarism). it turned out that such texts could have very long ngrams in common without having been rewritten from ecah other.
Yorick Wilks


On 17 Apr 2012, at 15:47, Mark Davies wrote:

> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
>
> Mark Davies, BYU
>
> -------------------------------------------
>
>
> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one  9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.
>
> The question is:
>
> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list