[Corpora-List] datasets for automatic keyword/keyphrase extraction task

Alexander Schutz goalscoringsuperstarhero at gmail.com
Wed May 5 08:37:27 UTC 2010


Greetings,

a dataset resulting from my master's thesis, 'Keyphrase Extraction
from Single Documents in the Open Domain Exploiting Linguistic and
Statistical Methods' [1] is available at [2].

It was based on the PubMed dataset available for download [3], which
already contains keyphrases for documents.
My dataset basically contains a reference back to the original PubMed
article via pmcid, the originally assigned keyphrases (gold standard),
the keyphrases assigned by my approach including confidence, some
indications as to which sort of match between gold standard and
approach has occurred, and some document statistics. This is all on a
per-document basis, covering 1323 documents from the original PubMed
dataset (80k or so docs).

For those who do not have time to read the full thesis, the procedure
is summarised in [4] and subsequent pages.
To gain a proper understanding of how this dataset was yielded, it is
at least necessary to read and understand [5], or the evaluation
chapter of the thesis.

Happy extracting.
Alex

P.S. There is also a dataset for qualitative evaluation results,
however as this comprised keyphrases from user-specified content, I
suspect this is not useful for anyone else.

P.P.S. If you have questions don't hesitate go gimme a shout

[1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf
[2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip
[3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz
[4] http://smile.deri.ie/projects/keyphrase-extraction
[5] http://smile.deri.ie/node/204


On Wed, May 5, 2010 at 2:46 AM, Su Nam Kim <sunamkim at gmail.com> wrote:
> Hello, all
> 4 datasets for automatic keyphrase extraction task are available at
> http://github.com/skrathnam/AutomaticKeyphraseExtraction
> If you have questions about datasets, please contact the data
> developers directly.
> Also, if you have a dataset to share, please, contact me to post.
> Thank you.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list