[Corpora-List] Comparing n-grams / authorship

Alberto Barron Cedeño lbarron at dsic.upv.es
Wed Apr 18 09:02:50 UTC 2012


Dear Mark,

>>From the numbers you mention ({6,7,8,9}-grams in common), it is very
likely that the book chapters have a co-derivation relationship (either
one of them was considered when producing the other or both considered a
common source).

You both can first look at the point of view of forensic linguistics.
[1] considers that "the longer a phrase, the less likely you are going
to find anybody use it". Experts estimate that (assuming circa 40% of
the words in a text are lexical) documents on the same topic could share
around 25% of lexical words. But if two documents contain circa 60% of
lexical words in common, they can be considered related [2]. Obviously
in this case we are talking about 1-grams. For higher level n-grams the
expected amount of shared terms is much lower. 

This fact takes us to the concept of "uniqueness": every person is
linguistically unique; no two people exist that express their ideas in
the exact same way [3]. Inspired in some slides presented by M.
Coulthard and M.T. Turell at PAN 2011 (see below), I tried a simple
"uniqueness" experiment. I took a set of phrases and split them in
n-grams of increasing order (0<n<14). The resulting chunk was quoted and
queried to a commercial search engine. I attach the results (don't worry
about the different colours, consider all of them as randomly selected
phrases): it is extremely unlikely that two sequences of text (already
from n=6) will occur in two presumably independent documents. You could
try the same exercise with the fragments you mention.

Now, what about two documents written by one single author? Table 1 in
[4] shows a toy experiment we carried out considering four documents
written by the same authors: On average only 3% of the 4-grams in two
documents occurred in common (versus 16% of 1-grams and 11% of 2-grams).
Note we are talking about documents on the same topic, by the same
authors.

You or your colleague might be interested in the PAN Initiative
(http://pan.webis.de), where automatic plagiarism detection and
authorship identification tasks are included, among others. You can get
an overview of the different models applied to these tasks from the
previous editions of the lab (everything is available online). The
Coulthard and Turell slides I mentioned before are available from the
2011 edition site (PAN @ CLEF'11), accesible from the same PAN website.

[1] Coulthard, Malcolm. ‘Author Identification, Idiolect, and Linguistic
Uniqueness’. Applied Linguistics 25 (December 1, 2004): 431–447.
[2] Coulthard, M. (2010). The Linguist as Detective: Forensic
Applications of Language Description.
[http://bit.ly/madrid_lingforense], Madrid, Spain. Talk at: Jornadas
(In)formativas de Lingüística Forense ((In)formative Conference on
Forensic Linguistics).
[3] Coulthard, M. and Alison, J. (2007). An Introduction to Forensic
Linguistics: Language in Evidence. Routledge, Oxon, UK.
[4] Barrón-Cedeño, A., Rosso, P. On Automatic Plagiarism Detection based
on n-grams Comparison. In: Boughanem et al. (Eds.) ECIR 2009, LNCS 5478,
pp. 696-700, Springer-Verlag Berlin Heidelberg (2009) 

Kind regards,
Alberto

-- 
Alberto Barrón-Cedeño 
Department of Information Systems and Computation (Ph.D. student)
Universidad Politécnica de Valencia
http://www.dsic.upv.es/~lbarron


On Tue, 2012-04-17 at 19:47 +0000, Mark Davies wrote:
> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
> 
> Mark Davies, BYU
> 
> -------------------------------------------
> 
> 
> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one  9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B. 
> 
> The question is:
> 
> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
A non-text attachment was scrubbed...
Name: uniqueness_example.png
Type: image/png
Size: 68197 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120418/f48448ce/attachment.png>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list