[Corpora-List] Summary: resources for evaluating term extraction

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Mon Feb 24 14:29:14 UTC 2014


Dear Johannes,
 
I believe that the JRC EuroVoc Indexer software JEX
<http://ipsc.jrc.ec.europa.eu/index.php?id=60>  does more or less what you
are looking for. It indexes new documents according to the controlled
vocabulary wide-coverage thesaurus EuroVoc (which consists of over 6000
classes/categories/descriptors) by performing a profile-based category
ranking task, i.e. it first recognises the statistically salient words in
the new document and then compares these to the pre-generated category
profiles, consisting of words and their weights. The JEX software is freely
downloadable from http://ipsc.jrc.ec.europa.eu/index.php?id=60, together
with the document collection it has been trained on. You can also re-train
JEX using your own document collection. JEX has been pre-trained for 22
languages, on tens of thousands of documents each. The software is described
in the paper:
 
Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). JRC EuroVoc Indexer
JEX - A freely available multi-label categorisation tool
<http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf> .
Proceedings of the 8th international conference on Language Resources and
Evaluation (LREC'2012), pp. 798-805, Istanbul, 21-27 May 2012. Available at
http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf, 
 
where you also find references to related work. 
 
The 22 languages covered are: Bulgarian, Czech, Danish, Dutch, English,
Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian,
Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish
and Swedish. 
 
I hope this helps.
 
All the best,
 

Ralf
 
 
 
Ralf Steinberger (Ralf.Steinberger at jrc.ec.europa.eu) 
European Commission - Joint Research Centre (JRC)
IPSC - GlobeSec - OPTIMA (OPensource Text Information Mining and Analysis)
URL - Applications: http://emm.newsbrief.eu/overview.html
URL - The science behind them:  <http://langtech.jrc.it/>
http://langtech.jrc.ec.europa.eu
T.P. 267, Via Fermi 2749
21027 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154
Secretary: +39 0332 78-5648 or 9478
 
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Johannes Hellrich
Sent: 24 February 2014 13:26
To: corpora at uib.no; adam at lexmasterclass.com
Subject: Re: [Corpora-List] Summary: resources for evaluating term
extraction
 
Dear all,

is anyone aware of a resource for grounded term extraction, i.e. linking the
newly extracted terms to an existing ontology/thesaurus/... ?
Thanks,

Johannes


> Message: 5
> Date: Mon, 24 Feb 2014 10:29:09 +0000
> From: Adam Kilgarriff <adam at lexmasterclass.com>
> Subject: Re: [Corpora-List] Summary: resources for evaluating term
>       extraction
> To: "corpora at hd.uib.no" <corpora at hd.uib.no>
>
> Apologies - I missed out
>
> 6.  María José Marín Pérez has created a corpus of legal English (BLARC)
> and has used it for extensive term-extraction
> experiments, and can provide both the corpus and the lists of terms (ppaer
> submitted to COLING)
>
> Adam
>
>
> On 24 February 2014 10:16, Adam Kilgarriff <adam at lexmasterclass.com>
wrote:
>
>> Dear all
>>
>> here is a summary of responses to my request for resources for evaluating
>> term extraction.
>>
>> 1.  TTC project has prepared corpora and terms for 2 domains and seven
>> languages: see
>>
>> http://www.lina.univ-nantes.fr/?Reference-Term-Lists-of-TTC.html
>>
>> Thanks to Anne Schumann
>>
>> 2  ACL Anthology corpus has been marked up with "valid terms" and
>> "technology terms".
>> Thanks to Behrang Qasemizadeh
>>
>> 3. Georgeta Bordea says:
>> In our previous work [1] done in the context of the Saffron project [2]
we
>> were interested in cross-domain evaluation of term extraction. Because we
>> did not find other datasets similar to GENIA we relied on datasets
>> annotated for keyphrase extraction [3] and index term assignment [4]
>> which are more abundant.
>>
>> [1]
>> https://lipn.univ-paris13.fr/tia2013/Proceedings/actesTIA2013.pdf#page=59
>> [2] http://saffron.deri.ie/
>> [3] https://github.com/snkim/AutomaticKeyphraseExtraction
>> [4] http://code.google.com/p/maui-indexer/wiki/Resources
>>
>>
>> 4. Kevin Cohen and Sophia Ananiadou pointed to resources related to the
>> Termine tool: however they did not include reference lists of 'gold
>> standard' terms.
>>
>> 5. Viktor Pekar pointed to a SemEval task which included "aspect term
>> extraction" in the domain of restaurant reviews, by which they mean
>> "service" and "staff" in the sentence "I liked the service and the
staff".
>> see http://alt.qcri.org/semeval2014/task4/  This wasn't quite what we
>> were looking for.
>>
>> Thanks all
>>
>> Adam
>>
>> ========original post====================
>> Date: Wed, 19 Feb 2014 11:34:36 +0000
>> Subject: [Corpora-List] Resources for evaluating term extraction
>>
>> Dear all,
>>
>> The Sketch Engine now supports term extraction for many languages - and
we
>> want to evaluate it.
>>
>> For that, we need domain corpora in which somebody has gone through
>> identifying all the 'true' terms.  Then we can compute our system's
>> precision and recall.
>>
>> We are aware of GENIA, for English, and are using that already (key
>> citation here: A comparative evaluation of term recognition
>> algorithms. 2008: Z Zhang, J Iria, CA Brewster, F Ciravegna)
>>
>> Any corpus with "the terms it contains", conscientiously produced, will
>> help us.
>>
>> Pointers please!
>>
>> Adam
>> --
>> ========================================
>> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
>> adam at lexmasterclass.com
>> Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>
>>
>> Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>
>>
>> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>>
>>                        *DANTE: a lexical database for English
>> <http://www.webdante.com>                  *
>> ========================================
>>
>
>
>
> --
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director                                    Lexical Computing
> Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow                 University of
> Leeds<http://leeds.ac.uk>
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>
>                        *DANTE: a lexical database for English
> <http://www.webdante.com>                  *
> ========================================
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 12364 bytes
> Desc: not available
> URL:
<http://www.uib.no/mailman/public/corpora/attachments/20140224/609438bd/atta
chment.txt>
>
> ----------------------------------------------------------------------
> Send Corpora mailing list submissions to
>       corpora at uib.no
>
> To subscribe or unsubscribe via the World Wide Web, visit
>       http://mailman.uib.no/listinfo/corpora
> or, via email, send a message with subject or body 'help' to
>       corpora-request at uib.no
>
> You can reach the person managing the list at
>       corpora-owner at uib.no
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Corpora digest..."
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140224/d763357a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list