[Corpora-List] A corpus to evaluate Keyword Extraction techniques

Chris Fournier cfour037 at uottawa.ca
Mon Jan 17 18:20:45 UTC 2011


Sandra,

Su Nam Kim has collected and hosts a number of popular keyphrase extraction
corpora from a variety of papers and theses on
github<https://github.com/snkim/AutomaticKeyphraseExtraction>
.

Chris


On Mon, Jan 17, 2011 at 9:26 AM, Diana Inkpen <diana at site.uottawa.ca> wrote:

>
>
> -------- Original Message --------  Subject: Re: [Corpora-List] A corpus
> to evaluate Keyword Extraction techniques  Date: Mon, 17 Jan 2011 10:29:59
> +0000  From: Alexander Schutz <goalscoringsuperstarhero at gmail.com><goalscoringsuperstarhero at gmail.com>  To:
> Sandra Garcia Blasco <sgarcia at dsic.upv.es> <sgarcia at dsic.upv.es>  CC:
> corpora at uib.no
>
> Apologies,
>
> it appears the PubMed URL has changed, I should have checked before sending.
> Now, [1] includes a number of links to downloadable articles , in the section
> XML for data mining via FTP .
>
> [1] http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html
>
> Kind regards,
> Alex
>
>
> On Mon, Jan 17, 2011 at 10:26 AM, Alexander Schutz
> <goalscoringsuperstarhero at gmail.com> <goalscoringsuperstarhero at gmail.com> wrote:
> > Sandra,
> >
> > a dataset resulting from my master's thesis, 'Keyphrase Extraction
> > from Single Documents in the Open Domain Exploiting Linguistic and
> > Statistical Methods' [1] is available at [2].
> >
> > It was based on the PubMed dataset available for download [3], which
> > already contains keyphrases for documents.
> > My dataset basically contains a reference back to the original PubMed
> > article via pmcid, the originally assigned keyphrases (gold standard),
> > the keyphrases assigned by my approach including confidence, some
> > indications as to which sort of match between gold standard and
> > approach has occurred, and some document statistics. This is all on a
> > per-document basis, covering 1323 documents from the original PubMed
> > dataset (80k or so docs).
> >
> > In case you do not have time to read the full thesis, the procedure
> > is summarised in [4] and subsequent pages.
> > To gain a proper understanding of how this dataset was yielded, it is
> > at least necessary to read and understand [5], or the evaluation
> > chapter of the thesis.
> >
> > Happy extracting.
> > Alex
> >
> > P.S. There is also a dataset for qualitative evaluation results,
> > however as this comprised keyphrases from user-specified content, I
> > suspect this is not useful for anyone else.
> >
> > P.P.S. If you have questions don't hesitate go gimme a shout
> >
> > [1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf
> > [2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip
> > [3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz
> > [4] http://smile.deri.ie/projects/keyphrase-extraction
> > [5] http://smile.deri.ie/node/204
> >
> > On Mon, Jan 17, 2011 at 8:29 AM, Sandra Garcia Blasco
> > <sgarcia at dsic.upv.es> <sgarcia at dsic.upv.es> wrote:
> >> Dear all,
> >>
> >> We are interested in evaluate our method for Keyword Extraction, but we are
> >> having a hard time finding a corpus to evaluate it. Does any of you know of
> >> an available corpus of texts with related keywords?
> >>
> >> Thank you very much for your help,
> >>
> >>
> >> Sandra Garcia --
> >>
> >> Universitat Politécnica de Valencia
> >>
> >> _______________________________________________
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> >>
> >
>
> _______________________________________________
> Corpora mailing listCorpora at uib.nohttp://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110117/9543e35a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list