<div>Sandra,</div><div><br></div><div>Su Nam Kim has collected and hosts a number of popular keyphrase extraction corpora from a variety of papers and theses on <a href="https://github.com/snkim/AutomaticKeyphraseExtraction">github</a>.</div>

<div><br></div><div>Chris</div><br><br><div class="gmail_quote">On Mon, Jan 17, 2011 at 9:26 AM, Diana Inkpen <span dir="ltr"><<a href="mailto:diana@site.uottawa.ca">diana@site.uottawa.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

  <div text="#000000" bgcolor="#ffffff">

    <br>

    <br>

    -------- Original Message --------

    <table border="0" cellpadding="0" cellspacing="0">

      <tbody>

        <tr>

          <th valign="BASELINE" align="RIGHT" nowrap>Subject: </th>

          <td>Re: [Corpora-List] A corpus to evaluate Keyword Extraction

            techniques</td>

        </tr>

        <tr>

          <th valign="BASELINE" align="RIGHT" nowrap>Date: </th>

          <td>Mon, 17 Jan 2011 10:29:59 +0000</td>

        </tr>

        <tr>

          <th valign="BASELINE" align="RIGHT" nowrap>From: </th>

          <td>Alexander Schutz

            <a href="mailto:goalscoringsuperstarhero@gmail.com" target="_blank"><goalscoringsuperstarhero@gmail.com></a></td>

        </tr>

        <tr>

          <th valign="BASELINE" align="RIGHT" nowrap>To: </th>

          <td>Sandra Garcia Blasco <a href="mailto:sgarcia@dsic.upv.es" target="_blank"><sgarcia@dsic.upv.es></a></td>

        </tr>

        <tr>

          <th valign="BASELINE" align="RIGHT" nowrap>CC: </th>

          <td><a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a></td>

        </tr>

      </tbody>

    </table>

    <br>

    <br>

    <pre>Apologies,

it appears the PubMed URL has changed, I should have checked before sending.

Now, [1] includes a number of links to downloadable articles , in the section

XML for data mining via FTP .

[1] <a href="http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html" target="_blank">http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html</a>

Kind regards,

Alex

On Mon, Jan 17, 2011 at 10:26 AM, Alexander Schutz

<div><div></div><div class="h5"><a href="mailto:goalscoringsuperstarhero@gmail.com" target="_blank"><goalscoringsuperstarhero@gmail.com></a> wrote:

> Sandra,

>

> a dataset resulting from my master's thesis, 'Keyphrase Extraction

> from Single Documents in the Open Domain Exploiting Linguistic and

> Statistical Methods' [1] is available at [2].

>

> It was based on the PubMed dataset available for download [3], which

> already contains keyphrases for documents.

> My dataset basically contains a reference back to the original PubMed

> article via pmcid, the originally assigned keyphrases (gold standard),

> the keyphrases assigned by my approach including confidence, some

> indications as to which sort of match between gold standard and

> approach has occurred, and some document statistics. This is all on a

> per-document basis, covering 1323 documents from the original PubMed

> dataset (80k or so docs).

>

> In case you do not have time to read the full thesis, the procedure

> is summarised in [4] and subsequent pages.

> To gain a proper understanding of how this dataset was yielded, it is

> at least necessary to read and understand [5], or the evaluation

> chapter of the thesis.

>

> Happy extracting.

> Alex

>

> P.S. There is also a dataset for qualitative evaluation results,

> however as this comprised keyphrases from user-specified content, I

> suspect this is not useful for anyone else.

>

> P.P.S. If you have questions don't hesitate go gimme a shout

>

> [1] <a href="http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf" target="_blank">http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf</a>

> [2] <a href="http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip" target="_blank">http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip</a>

> [3] <a href="ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz" target="_blank">ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz</a>

> [4] <a href="http://smile.deri.ie/projects/keyphrase-extraction" target="_blank">http://smile.deri.ie/projects/keyphrase-extraction</a>

> [5] <a href="http://smile.deri.ie/node/204" target="_blank">http://smile.deri.ie/node/204</a>

>

> On Mon, Jan 17, 2011 at 8:29 AM, Sandra Garcia Blasco

> <a href="mailto:sgarcia@dsic.upv.es" target="_blank"><sgarcia@dsic.upv.es></a> wrote:

>> Dear all,

>>

>> We are interested in evaluate our method for Keyword Extraction, but we are

>> having a hard time finding a corpus to evaluate it. Does any of you know of

>> an available corpus of texts with related keywords?

>>

>> Thank you very much for your help,

>>

>>

>> Sandra Garcia --

>>

>> Universitat Politécnica de Valencia

>>

>> _______________________________________________

>> Corpora mailing list

>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>

>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>

>>

>>

>

_______________________________________________

Corpora mailing list

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>

</div></div></pre>

  </div>

</blockquote></div><br>