<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>


<head>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">


<meta name=Generator content="Microsoft Word 10 (filtered)">


<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman";}

a:link, span.MsoHyperlink

        {color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {color:blue;

        text-decoration:underline;}

span.EmailStyle17

        {font-family:Arial;

        color:navy;}

@page Section1

        {size:612.0pt 792.0pt;

        margin:72.0pt 90.0pt 72.0pt 90.0pt;}

div.Section1

        {page:Section1;}

-->

</style>


</head>


<body lang=EN-US link=blue vlink=blue>


<div class=Section1>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>We recently encountered the problem with

the LDC’s English Gigaword corpus: many of the stories in this newswire

corpus occur repeatedly, with changing datelines, often in updated and revised

forms.  We have also hit the question when producing corpora for dictionary-making

from the web.  </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>A crucial question in these situations is:

what are the objects which might be considered duplicates?  If two stories

share two paragraphs, but each have two further paragraphs that are not shared,

it is not obvious what should be done.  Our solution (working with Infogistics

Ltd, from </span></font><font size=2 color=navy face=Arial><span

  style='font-size:10.0pt;font-family:Arial;color:navy'>Edinburgh</span></font><font

size=2 color=navy face=Arial><span style='font-size:10.0pt;font-family:Arial;

color:navy'>) heuristically identified ‘paragraphs’ and treated

them as the objects which might be duplicates.  It also looked at successions

of paragraphs because, firstly, identical short paragraphs may have been

produced independently on two or more occasions, and secondly, stripping out

paragraphs destroys the integrity of the text, so we did not want to do it lightly.</span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>I think one set of papers mentioned in earlier

responses to the query, which used document similarity, won’t help in our

scenario but another, which looks for longest common substrings (see Alexander

Clark’s mail) will.</span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>The interesting theoretical question lurking

around here is: when does a common expression (essential subject matter for

corpus linguistics) turn into duplication (which is not wanted).  Duplication

of the former kind is the fabric of language.  If I speak in formulae and clichés,

as so many of us do so much of the time, it is likely that my speaker turns will

exactly match others’. Quotations are another intermediate case –

if someone quotes half a sentence from a text that is also in the corpus, you

want to leave it in.  If it is a couple of sentences – maybe.  If it is a

couple of paragraphs or more you may well want to throw it out as duplication. 

My suspicion is, it will always depend on what you want to do with the corpus.</span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>Adam Kilgarriff</span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'>Lexical Computing Ltd</span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:

10.0pt;font-family:Arial;color:navy'> </span></font></p>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Tahoma><span

style='font-size:10.0pt;font-family:Tahoma'>-----Original Message-----<br>

<b><span style='font-weight:bold'>From:</span></b> owner-corpora@lists.uib.no

[mailto:owner-corpora@lists.uib.no] <b><span style='font-weight:bold'>On Behalf

Of </span></b>Ralf Steinberger<br>

<b><span style='font-weight:bold'>Sent:</span></b> </span></font><font size=2 face=Tahoma><span style='font-size:10.0pt;font-family:Tahoma'>22 December

 2004</span></font><font size=2 face=Tahoma><span style='font-size:10.0pt;

font-family:Tahoma'> </span></font><font size=2 face=Tahoma><span

 style='font-size:10.0pt;font-family:Tahoma'>16:46</span></font><font size=2

face=Tahoma><span style='font-size:10.0pt;font-family:Tahoma'><br>

<b><span style='font-weight:bold'>To:</span></b> List Corpora (Corpora list)<br>

<b><span style='font-weight:bold'>Subject:</span></b> [Corpora-List] Q: How to

identify duplicates in a large document collection</span></font></p>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>We are facing the task of having to

find duplicate and near-duplicate documents in a collection of about 1 million

texts. Can anyone give us advice on how to approach this challenge? </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>The documents are in various

formats (html, PDF, MS-Word, plain text, ...) so that we intend

to first convert them to plain text. It is possible that the same text is

present in the document collection in different formats.</span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>For smaller collections, we

identify (near)-duplicates by applying hierarchical clustering techniques, but

with this approach, we are limited to a few thousand documents. </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>Any pointers are welcome. Thank you.</span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>Ralf Steinberger</span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'>European Commission - Joint Research

Centre</span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=2 face=Arial><span

style='font-size:10.0pt;font-family:Arial'><a href="http://www.jrc.it/langtech">http://www.jrc.it/langtech</a></span></font></p>


</div>


<div>


<p class=MsoNormal style='margin-left:36.0pt'><font size=3

face="Times New Roman"><span style='font-size:12.0pt'> </span></font></p>


</div>


</div>


</body>


</html>