<div>Sandra,</div><div><br></div><div>Su Nam Kim has collected and hosts a number of popular keyphrase extraction corpora from a variety of papers and theses on <a href="https://github.com/snkim/AutomaticKeyphraseExtraction">github</a>.</div>
<div><br></div><div>Chris</div><br><br><div class="gmail_quote">On Mon, Jan 17, 2011 at 9:26 AM, Diana Inkpen <span dir="ltr"><<a href="mailto:diana@site.uottawa.ca">diana@site.uottawa.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div text="#000000" bgcolor="#ffffff">
<br>
<br>
-------- Original Message --------
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th valign="BASELINE" align="RIGHT" nowrap>Subject: </th>
<td>Re: [Corpora-List] A corpus to evaluate Keyword Extraction
techniques</td>
</tr>
<tr>
<th valign="BASELINE" align="RIGHT" nowrap>Date: </th>
<td>Mon, 17 Jan 2011 10:29:59 +0000</td>
</tr>
<tr>
<th valign="BASELINE" align="RIGHT" nowrap>From: </th>
<td>Alexander Schutz
<a href="mailto:goalscoringsuperstarhero@gmail.com" target="_blank"><goalscoringsuperstarhero@gmail.com></a></td>
</tr>
<tr>
<th valign="BASELINE" align="RIGHT" nowrap>To: </th>
<td>Sandra Garcia Blasco <a href="mailto:sgarcia@dsic.upv.es" target="_blank"><sgarcia@dsic.upv.es></a></td>
</tr>
<tr>
<th valign="BASELINE" align="RIGHT" nowrap>CC: </th>
<td><a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a></td>
</tr>
</tbody>
</table>
<br>
<br>
<pre>Apologies,
it appears the PubMed URL has changed, I should have checked before sending.
Now, [1] includes a number of links to downloadable articles , in the section
XML for data mining via FTP .
[1] <a href="http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html" target="_blank">http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html</a>
Kind regards,
Alex
On Mon, Jan 17, 2011 at 10:26 AM, Alexander Schutz
<div><div></div><div class="h5"><a href="mailto:goalscoringsuperstarhero@gmail.com" target="_blank"><goalscoringsuperstarhero@gmail.com></a> wrote:
> Sandra,
>
> a dataset resulting from my master's thesis, 'Keyphrase Extraction
> from Single Documents in the Open Domain Exploiting Linguistic and
> Statistical Methods' [1] is available at [2].
>
> It was based on the PubMed dataset available for download [3], which
> already contains keyphrases for documents.
> My dataset basically contains a reference back to the original PubMed
> article via pmcid, the originally assigned keyphrases (gold standard),
> the keyphrases assigned by my approach including confidence, some
> indications as to which sort of match between gold standard and
> approach has occurred, and some document statistics. This is all on a
> per-document basis, covering 1323 documents from the original PubMed
> dataset (80k or so docs).
>
> In case you do not have time to read the full thesis, the procedure
> is summarised in [4] and subsequent pages.
> To gain a proper understanding of how this dataset was yielded, it is
> at least necessary to read and understand [5], or the evaluation
> chapter of the thesis.
>
> Happy extracting.
> Alex
>
> P.S. There is also a dataset for qualitative evaluation results,
> however as this comprised keyphrases from user-specified content, I
> suspect this is not useful for anyone else.
>
> P.P.S. If you have questions don't hesitate go gimme a shout
>
> [1] <a href="http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf" target="_blank">http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf</a>
> [2] <a href="http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip" target="_blank">http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip</a>
> [3] <a href="ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz" target="_blank">ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz</a>
> [4] <a href="http://smile.deri.ie/projects/keyphrase-extraction" target="_blank">http://smile.deri.ie/projects/keyphrase-extraction</a>
> [5] <a href="http://smile.deri.ie/node/204" target="_blank">http://smile.deri.ie/node/204</a>
>
> On Mon, Jan 17, 2011 at 8:29 AM, Sandra Garcia Blasco
> <a href="mailto:sgarcia@dsic.upv.es" target="_blank"><sgarcia@dsic.upv.es></a> wrote:
>> Dear all,
>>
>> We are interested in evaluate our method for Keyword Extraction, but we are
>> having a hard time finding a corpus to evaluate it. Does any of you know of
>> an available corpus of texts with related keywords?
>>
>> Thank you very much for your help,
>>
>>
>> Sandra Garcia --
>>
>> Universitat Politécnica de Valencia
>>
>> _______________________________________________
>> Corpora mailing list
>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>
>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>
>>
>>
>
_______________________________________________
Corpora mailing list
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>
</div></div></pre>
</div>
</blockquote></div><br>