[Corpora-List] Resources for evaluating term extraction

Thu Feb 20 15:46:21 UTC 2014

Dear Adam,

There is also available a data-set of annotated terms, which are extracted from the ACL-ARC corpus[1]. It is not a multilingual dataset but since the ACL ARC is available in the Sketch Engine [2], I though you may find it helpful.

Candidate terms are extracted using a part-of-speech based filtering method and then are sorted by several different scores amongst them the C-Value. The top-1000 extracted terms are then annotated.

Terms are annotated at three different levels: invalid terms, valid terms, and technology terms. The technology terms are those valid terms that refer to a "technology concept" in the domain of computational linguistics. For example:

*         Valid terms: natural language, parse tree, language model

*         Technology terms: machine translation, natural language processing

*         Invalid terms: sex and violence, dr. smith, intended referent, % error rate

The statistics for the data-set currently reads:
Size:       61609 terms annotated out of 1.5m terms
#Valid Terms: 18186
#Amongst them are tech-tech: 12389
#Invalid terms: 43423

A small number of terms (only 300) are annotated several times by a few number of students; therefore, an inter-annotator agreement is also available from that data.
Please find attached a sample of the top 1000 terms sorted descending by their C-Value score (annotatd_cvalue.cmpr_cvalue). If you like to access the rest of the dataset please let me know.

Best regards,

Behrang

PS. Attached also is the list of terms unannotated, in case you want to compare your annotation with the annotated file (un_annotated__term.cmpr_cvalue.txt).

[1] http://acl-arc.comp.nus.edu.sg/
[2] https://the.sketchengine.co.uk/bonito/run.cgi/first_form?corpname=preloaded/aclarc_1;

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Sophia
Sent: 19 February 2014 16:30
To: Kevin B.Cohen
Cc: corpora at hd.uib.no
Subject: Re: [Corpora-List] Resources for evaluating term extraction

Termine http://www.nactem.ac.uk/software/termine/ would give you candidate terms but these also would have to be evaluated.
C-value upon which Termine is based, has been implemented for several languages, e.g. Spanish, Japanese, Chinese, etc.

Sophia

On 19 Feb 2014, at 16:00, Kevin B. Cohen wrote:

Hi, Adam,

I would recommend talking with Sophia Ananiadou, the creator of TerMine.
Kev

On Wed, Feb 19, 2014 at 4:34 AM, Adam Kilgarriff <adam at lexmasterclass.com<mailto:adam at lexmasterclass.com>> wrote:
Dear all,

The Sketch Engine now supports term extraction for many languages - and we want to evaluate it.

For that, we need domain corpora in which somebody has gone through identifying all the 'true' terms.  Then we can compute our system's precision and recall.

We are aware of GENIA, for English, and are using that already (key citation here: A comparative evaluation of term recognition algorithms<http://scholar.google.co.uk/citations?view_op=view_citation&hl=en&user=VsRwsN8AAAAJ&citation_for_view=VsRwsN8AAAAJ:u5HHmVD_uO8C> 2008: Z Zhang, J Iria, CA Brewster, F Ciravegna)

Any corpus with "the terms it contains", conscientiously produced, will help us.

Pointers please!

Adam

--
========================================
Adam Kilgarriff<http://www.kilgarriff.co.uk/>                  adam at lexmasterclass.com<mailto:adam at lexmasterclass.com>
Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk/>
Corpora for all with the Sketch Engine<http://www.sketchengine.co.uk/>
                        DANTE: a lexical database for English<http://www.webdante.com/>
========================================

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora

--
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program,
U. Colorado School of Medicine
303-916-2417
http://compbio.ucdenver.edu/Hunter_lab/Cohen

----------
Professor Sophia Ananiadou, School of Computer Science,
Director, National Centre for Text Mining
Manchester Institute of Biotechnology
University of Manchester
131 Princess Street, M1 7DN
www.nactem.ac.uk<http://www.nactem.ac.uk/>
sophia.ananiadou at manchester.ac.uk<mailto:Sophia.Ananiadou at manchester.ac.uk>
http://www.nactem.ac.uk/staff/sophia.ananiadou/
tel: +44 (0)161 306 3092

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140220/95dd9fd6/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: annotatd_cvalue.cmpr_cvalue.txt
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140220/95dd9fd6/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: un_annotated__term.cmpr_cvalue.txt
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140220/95dd9fd6/attachment-0003.txt>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora