[Corpora-List] A corpus to evaluate Keyword Extraction techniques

Alexander Schutz goalscoringsuperstarhero at gmail.com
Mon Jan 17 10:26:24 UTC 2011


Sandra,

a dataset resulting from my master's thesis, 'Keyphrase Extraction
from Single Documents in the Open Domain Exploiting Linguistic and
Statistical Methods' [1] is available at [2].

It was based on the PubMed dataset available for download [3], which
already contains keyphrases for documents.
My dataset basically contains a reference back to the original PubMed
article via pmcid, the originally assigned keyphrases (gold standard),
the keyphrases assigned by my approach including confidence, some
indications as to which sort of match between gold standard and
approach has occurred, and some document statistics. This is all on a
per-document basis, covering 1323 documents from the original PubMed
dataset (80k or so docs).

In case you do not have time to read the full thesis, the procedure
is summarised in [4] and subsequent pages.
To gain a proper understanding of how this dataset was yielded, it is
at least necessary to read and understand [5], or the evaluation
chapter of the thesis.

Happy extracting.
Alex

P.S. There is also a dataset for qualitative evaluation results,
however as this comprised keyphrases from user-specified content, I
suspect this is not useful for anyone else.

P.P.S. If you have questions don't hesitate go gimme a shout

[1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf
[2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip
[3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz
[4] http://smile.deri.ie/projects/keyphrase-extraction
[5] http://smile.deri.ie/node/204

On Mon, Jan 17, 2011 at 8:29 AM, Sandra Garcia Blasco
<sgarcia at dsic.upv.es> wrote:
> Dear all,
>
> We are interested in evaluate our method for Keyword Extraction, but we are
> having a hard time finding a corpus to evaluate it. Does any of you know of
> an available corpus of texts with related keywords?
>
> Thank you very much for your help,
>
>
> Sandra Garcia --
>
> Universitat Politécnica de Valencia
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list