[Corpora-List] corpus of abstracts/papers with free-form keywords

Aurelie Neveol aneveol at gmail.com
Thu Dec 2 15:15:11 UTC 2010


Mark,

In the biomedical domain, the PubMed Central Open Access corpus has
about 14,000 full text articles with free-text author keywords. In a
recent study, author keywords were compared to MeSH terms assigned by
MEDLINE indexers using a similarity measure:

Névéol A, Islamaj-Doğan R, Lu Z. Author Keywords in Biomedical Journal
Articles. Proc AMIA Annu Symp. 2010:537-41.
http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/NeveolAMIA10.pdf

Still  in the scientific domain, but not restricted to biomedical
data, a similar corpus with author assigned "key-phrases" was used in
the SEMEVAL task "Automatic Keyphrase Extraction from Scientific
Articles" http://semeval2.fbk.eu/semeval2.php?location=tasks#T6
The corpus comprises full text articles and author-assigned free text keywords.

I hope this helps,

Aurelie

On Thu, Dec 2, 2010 at 9:12 AM, Mark Johnson <mark.johnson at mq.edu.au> wrote:
> I'm trying to evaluate unsupervised algorithms for identifying topical
> collocations in document collections.  One idea I've had is: if I had a
> corpus of abstracts or papers that have been manually labelled with
> free-form keywords, I could evaluate the degree to which the topical
> collocations match the human-annotated keywords.   Can anyone point me to a
> suitable corpus -- perhaps one that has already been used for this purpose?
>
> Thanks in advance,
>
> Mark Johnson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
--
Aurélie Névéol, PhD

National Library of Medicine
Bldg. 38A, 10N-003B
9000 Rockville Pike
Bethesda, MD 20894
USA

Tel: (+1) (301) 435 9026

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list