[Corpora-List] corpus of abstracts/papers with free-form keywords

Thu Dec 2 16:09:15 UTC 2010

Mark,

a dataset resulting from my master's thesis, 'Keyphrase Extraction
from Single Documents in the Open Domain Exploiting Linguistic and
Statistical Methods' [1] is available at [2].

Possibly related to what you want to investigate.

It was based on the PubMed dataset available for download [3], which
already contains keyphrases for documents.
My dataset basically contains a reference back to the original PubMed
article via pmcid, the originally assigned keyphrases (gold standard),
the keyphrases assigned by my approach including confidence, some
indications as to which sort of match between gold standard and
approach has occurred, and some document statistics. This is all on a
per-document basis, covering 1323 documents from the original PubMed
dataset (80k or so docs).

For those who do not have time to read the full thesis, the procedure
is summarised in [4] and subsequent pages.
To gain a proper understanding of how this dataset was yielded, it is
at least necessary to read and understand [5], or the evaluation
chapter of the thesis.

Happy extracting.
Alex

P.S. There is also a dataset for qualitative evaluation results,
however as this comprised keyphrases from user-specified content, I
suspect this is not useful for anyone else.

P.P.S. If you have questions don't hesitate go gimme a shout

[1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf
[2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip
[3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz
[4] http://smile.deri.ie/projects/keyphrase-extraction
[5] http://smile.deri.ie/node/204

On Thu, Dec 2, 2010 at 2:12 PM, Mark Johnson <mark.johnson at mq.edu.au> wrote:
> I'm trying to evaluate unsupervised algorithms for identifying topical
> collocations in document collections.  One idea I've had is: if I had a
> corpus of abstracts or papers that have been manually labelled with
> free-form keywords, I could evaluate the degree to which the topical
> collocations match the human-annotated keywords.   Can anyone point me to a
> suitable corpus -- perhaps one that has already been used for this purpose?
>
> Thanks in advance,
>
> Mark Johnson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora