[Corpora-List] A corpus to evaluate Keyword Extraction techniques

Aurelie Neveol aneveol at gmail.com
Thu Jan 20 15:39:15 UTC 2011


Sandra,

Someone asked a similar question on the list last month, requesting pointers
to a "corpus of abstracts/papers with free-form
keywords<http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1012&L=CORPORA&F=&S=&P=8462>".
The answers from the corpora archive might be helpful to you:

http://listserv.linguistlist.org/cgi-bin/wa?A1=ind1012&L=CORPORA#37

Best regards,

Aurelie


On Mon, Jan 17, 2011 at 1:20 PM, Chris Fournier <cfour037 at uottawa.ca> wrote:

> Sandra,
>
> Su Nam Kim has collected and hosts a number of popular keyphrase extraction
> corpora from a variety of papers and theses on github<https://github.com/snkim/AutomaticKeyphraseExtraction>
> .
>
> Chris
>
>
> On Mon, Jan 17, 2011 at 9:26 AM, Diana Inkpen <diana at site.uottawa.ca>wrote:
>
>>
>>
>> -------- Original Message --------  Subject: Re: [Corpora-List] A corpus
>> to evaluate Keyword Extraction techniques  Date: Mon, 17 Jan 2011
>> 10:29:59 +0000  From: Alexander Schutz
>> <goalscoringsuperstarhero at gmail.com> <goalscoringsuperstarhero at gmail.com>  To:
>> Sandra Garcia Blasco <sgarcia at dsic.upv.es> <sgarcia at dsic.upv.es>  CC:
>> corpora at uib.no
>>
>> Apologies,
>>
>> it appears the PubMed URL has changed, I should have checked before sending.
>> Now, [1] includes a number of links to downloadable articles , in the section
>> XML for data mining via FTP .
>>
>> [1] http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html
>>
>> Kind regards,
>> Alex
>>
>>
>> On Mon, Jan 17, 2011 at 10:26 AM, Alexander Schutz
>> <goalscoringsuperstarhero at gmail.com> <goalscoringsuperstarhero at gmail.com> wrote:
>> > Sandra,
>> >
>> > a dataset resulting from my master's thesis, 'Keyphrase Extraction
>> > from Single Documents in the Open Domain Exploiting Linguistic and
>> > Statistical Methods' [1] is available at [2].
>> >
>> > It was based on the PubMed dataset available for download [3], which
>> > already contains keyphrases for documents.
>> > My dataset basically contains a reference back to the original PubMed
>> > article via pmcid, the originally assigned keyphrases (gold standard),
>> > the keyphrases assigned by my approach including confidence, some
>> > indications as to which sort of match between gold standard and
>> > approach has occurred, and some document statistics. This is all on a
>> > per-document basis, covering 1323 documents from the original PubMed
>> > dataset (80k or so docs).
>> >
>> > In case you do not have time to read the full thesis, the procedure
>> > is summarised in [4] and subsequent pages.
>> > To gain a proper understanding of how this dataset was yielded, it is
>> > at least necessary to read and understand [5], or the evaluation
>> > chapter of the thesis.
>> >
>> > Happy extracting.
>> > Alex
>> >
>> > P.S. There is also a dataset for qualitative evaluation results,
>> > however as this comprised keyphrases from user-specified content, I
>> > suspect this is not useful for anyone else.
>> >
>> > P.P.S. If you have questions don't hesitate go gimme a shout
>> >
>> > [1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf
>> > [2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip
>> > [3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz
>> > [4] http://smile.deri.ie/projects/keyphrase-extraction
>> > [5] http://smile.deri.ie/node/204
>> >
>> > On Mon, Jan 17, 2011 at 8:29 AM, Sandra Garcia Blasco
>> > <sgarcia at dsic.upv.es> <sgarcia at dsic.upv.es>
>>  wrote:
>> >> Dear all,
>> >>
>> >> We are interested in evaluate our method for Keyword Extraction, but we are
>> >> having a hard time finding a corpus to evaluate it. Does any of you know of
>> >> an available corpus of texts with related keywords?
>> >>
>> >> Thank you very much for your help,
>> >>
>> >>
>> >> Sandra Garcia --
>> >>
>> >> Universitat Politécnica de Valencia
>> >>
>> >> _______________________________________________
>> >> Corpora mailing list
>> >> Corpora at uib.no
>>
>> >> http://mailman.uib.no/listinfo/corpora
>> >>
>> >>
>> >
>>
>> _______________________________________________
>> Corpora mailing listCorpora at uib.nohttp://mailman.uib.no/listinfo/corpora
>>
>>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
--
Aurélie Névéol, PhD

National Library of Medicine
Bldg. 38A, 10N-003B
9000 Rockville Pike
Bethesda, MD 20894
USA

Tel: (+1) (301) 435 9026
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110120/52beba4c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list