or something like this.   <a href="http://www.lextutor.ca/text_lex_compare/">http://www.lextutor.ca/text_lex_compare/</a><div><br></div><div>You enter two texts and the software will indicate overlapping words as well as unique words. </div>

<div><br></div><div>B</div><br><div class="gmail_quote">On Thu, Jul 19, 2012 at 4:11 PM, Martin Reynaert <span dir="ltr"><<a href="mailto:reynaert@uvt.nl" target="_blank">reynaert@uvt.nl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear Amanda,<br>

<br>

I have a sneaky feeling you may be interested in what is called<br>

`vocabulary growth curves'. In which case: do check out:<br>

<a href="http://zipfr.r-forge.r-project.org/" target="_blank">http://zipfr.r-forge.r-project.org/</a> .<br>

<br>

If that proves to be too much all of a sudden, do check out:<br>

<br>

H. Baayen (2001). Word frequency distributions. Kluwer, Dordrecht.<br>

<br>

You will want to go on to his more recent publications after that ;0)<br>

<br>

Best,<br>

<br>

Martin<br>

<div class="HOEnZb"><div class="h5"><br>

On 07/20/2012 12:43 AM, Amanda wrote:<br>

> Dear all,<br>

><br>

>     Does anyone know an existing (and available) software which can<br>

> automatically:<br>

><br>

>     1. Compare every two consecutive texts in a corpus; and<br>

>     2. List every new word that occur in the latter text?<br>

><br>

>     Or do you know any papers about that?<br>

><br>

>     Thank you for your help!<br>

><br>

> All the best.<br>

> Amanda<br>

><br>

> -----邮件原件-----<br>

> 发件人: <a href="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</a> [mailto:<a href="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</a>] 代表<br>

> <a href="mailto:corpora-request@uib.no">corpora-request@uib.no</a><br>

> 发送时间: 2012年7月19日 11:00<br>

> 收件人: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

> 主题: Corpora Digest, Vol 61, Issue 17<br>

><br>

> Today's Topics:<br>

><br>

>    1.  Seeking corpus for academic domain (Lushan Han)<br>

>    2. Re:  Seeking corpus for academic domain (Lushan Han)<br>

>    3.  English confusables (Carter, Simon)<br>

><br>

><br>

> ----------------------------------------------------------------------<br>

><br>

> Message: 1<br>

> Date: Wed, 18 Jul 2012 11:03:56 -0400<br>

> From: Lushan Han <<a href="mailto:lushan1@umbc.edu">lushan1@umbc.edu</a>><br>

> Subject: [Corpora-List] Seeking corpus for academic domain<br>

> To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

><br>

> Dear all,<br>

><br>

> I am looking for a very large corpus ( > 1 billion words) made for academic<br>

> domain, mainly describing university, project, conference, paper, author and<br>

> etc. I will compute statistics from it, which is used in building a query<br>

> system on structured data for academic domain.<br>

><br>

> Does anyone know such a corpus? Any information will be appreciated.<br>

><br>

><br>

> Thanks,<br>

><br>

> Lushan Han<br>

> -------------- next part --------------<br>

> A non-text attachment was scrubbed...<br>

> Name: not available<br>

> Type: text/html<br>

> Size: 497 bytes<br>

> Desc: not available<br>

> URL:<br>

> <<a href="http://www.uib.no/mailman/public/corpora/attachments/20120718/6bd5e090/atta" target="_blank">http://www.uib.no/mailman/public/corpora/attachments/20120718/6bd5e090/atta</a><br>

> chment.txt><br>

><br>

> ------------------------------<br>

><br>

> Message: 2<br>

> Date: Wed, 18 Jul 2012 15:30:20 -0400<br>

> From: Lushan Han <<a href="mailto:lushan1@umbc.edu">lushan1@umbc.edu</a>><br>

> Subject: Re: [Corpora-List] Seeking corpus for academic domain<br>

> To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

><br>

> A corpus of smaller size (e.g. millions of words) can also be very helpful<br>

> to me.  Please inform me if you happen to know it.<br>

><br>

> Thanks,<br>

><br>

> Lushan<br>

><br>

> On Wed, Jul 18, 2012 at 11:03 AM, Lushan Han <<a href="mailto:lushan1@umbc.edu">lushan1@umbc.edu</a>> wrote:<br>

><br>

>> Dear all,<br>

>><br>

>> I am looking for a very large corpus ( > 1 billion words) made for<br>

>> academic domain, mainly describing university, project, conference,<br>

>> paper, author and etc. I will compute statistics from it, which is<br>

>> used in building a query system on structured data for academic domain.<br>

>><br>

>> Does anyone know such a corpus? Any information will be appreciated.<br>

>><br>

>><br>

>> Thanks,<br>

>><br>

>> Lushan Han<br>

>><br>

> -------------- next part --------------<br>

> A non-text attachment was scrubbed...<br>

> Name: not available<br>

> Type: text/html<br>

> Size: 1021 bytes<br>

> Desc: not available<br>

> URL:<br>

> <<a href="http://www.uib.no/mailman/public/corpora/attachments/20120718/59a42a2e/atta" target="_blank">http://www.uib.no/mailman/public/corpora/attachments/20120718/59a42a2e/atta</a><br>

> chment.txt><br>

><br>

> ------------------------------<br>

><br>

> Message: 3<br>

> Date: Thu, 19 Jul 2012 09:28:03 +0000<br>

> From: "Carter, Simon" <<a href="mailto:S.C.Carter@uva.nl">S.C.Carter@uva.nl</a>><br>

> Subject: [Corpora-List] English confusables<br>

> To: "<a href="mailto:corpora@uib.no">corpora@uib.no</a>" <<a href="mailto:corpora@uib.no">corpora@uib.no</a>><br>

><br>

> Dear Corpora List,<br>

><br>

> I was wondering if there was list of English confusables somewhere?<br>

><br>

> Thanks,<br>

><br>

> Simon<br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

> ----------------------------------------------------------------------<br>

> Send Corpora mailing list submissions to<br>

>       <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

><br>

> To subscribe or unsubscribe via the World Wide Web, visit<br>

>       <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

> or, via email, send a message with subject or body 'help' to<br>

>       <a href="mailto:corpora-request@uib.no">corpora-request@uib.no</a><br>

><br>

> You can reach the person managing the list at<br>

>       <a href="mailto:corpora-owner@uib.no">corpora-owner@uib.no</a><br>

><br>

> When replying, please edit your Subject line so it is more specific than<br>

> "Re: Contents of Corpora digest..."<br>

><br>

><br>

> _______________________________________________<br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

><br>

><br>

> End of Corpora Digest, Vol 61, Issue 17<br>

> ***************************************<br>

><br>

><br>

> _______________________________________________<br>

> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ulugbek Nurmukhamedov, <div>Northern Arizona University, </div><div>GSAAL page - <a href="http://www.cal.nau.edu/gsaal/" target="_blank">http://www.cal.nau.edu/gsaal/</a> </div>

<div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse;color:rgb(136,136,136)"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse;color:rgb(136,136,136)">Be not content with stories of those who went before you. Go forth and create your own story (Mawlana al-Rumi)</span></div>

<br>