[Corpora-List] Re: near duplicate detection

Linda Bawcom linda.bawcom at sbcglobal.net
Sat Jun 4 01:21:43 UTC 2005


Dear Colleagues and List Members,

Marco Baroni very kindly ( and patiently I might add) tried to help me with a duplicate detection tool and happen to mention that someone (he didn't keep the message just the site address) had just written us about a tool to detect plagiarism that might be of help for my research.  I did not get the message, but then my internet connection here at the dorms in Liverpool is very cranky (if anyone else answered my e-mail to Nancy Ide regarding a duplicate detection tool and I did not answer, I apologize. Again I can only blame my connection which comes and goes for several seconds at its leisure)

In any case, the web site again is:  http://plagiarism.phys.virginia.edu/Wsoftware.html

I tried it and it worked brillantly!  I had 73 newspaper articles in seperate text files and  almost instanteously it detected all duplicate sentences (I did change the character settings and number of words that could be skipped in case of paraphasing of some kind, for example). And even more fotunately, it discovered, much to my chagrin,  that 3 of my articles were exactly the same (with the same title). What is even more helpful for my research is that I can see how the text was edited and similonyms were used (e.g. Tsunami in one text became tidal wave in another even though the rest of the paragraph was exactly the same). As it will only list two at a time (although it can be a long list of twos), I still need to do some cross referencing. But it has saved me considerable time. If you are interested, of the 70 texts 8 paris had duplicate texts ranging in duplication from 16% to 93%.

Therefore, I would very much like to thank the author of this tool (if you are out there please let me know) not only because it works so perfeclty, but  also for her or his  generousity in sharing it for free (perhaps that's only at the moment, but it's still  very much appreciated).

Kindest regards,
Linda Bawcom


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050603/f10dd9d2/attachment.htm>


More information about the Corpora mailing list