[Corpora-List] Question: Citing Linguistic Corpora

Khalid CHOUKRI choukri at elda.org
Mon Mar 18 10:38:04 UTC 2013


Dear Colleagues

I followed the discussion about citing Language resources and it reinforced our
views for the need to cite properly the resources our community is using.
I will not go through the importance of doing this here, I am sure we all agree
it is critical and urgent.

We were about to announce an initiative to respond to such needs, I am happy to
anticipate on our plans.

almost two years ago, at the FlareNET meeting
(http://www.flarenet.eu/sites/default/files/S1_Choukri_Position_Paper.pdf), I
made a proposal on the assignment of Persistent and Unique identifiers to
Language resources.
The idea was to go beyond the current references via URls (or DOI) to ensure
that we have a really permanent Identifier that would cover all existing data
sets including those not publicly available  (or not available on Internet).

The idea was discussed with major data centers managed/serving  by the NLP
community , in particular ELRA, LDC, Oriental-Cocosda, and some major
representatives of our field AFNLP, ISCA, EAMT, etc.

We reviewed all possibilities (a paper about this was published at ijcnlp2011 )
and envisaged all options
we looked at the current situation where ELRA, LDC and other data centers assign
identifiers to resources they distribute,
for example (ELRA-W0021     refers to ICE-GB (British English component of the
International Corpus of English) (ICE-GB) and LDC2011T13  refers to    Chinese
Gigaword Fifth Edition

These are different "local" identifiers assigned by data centers so we looked at
the possibility to use global identifiers beyond data centers such as URI, URN,
IAN (International Article Number) , PMID  (Life science & biomedical), last but
not least ISBN.
Each of these Identifiers has its advantages and drawbacks, the most common and
attractive one , ISBN, is also closely related to "Publishers" and in some
countries it implicitly refers to Copyright  law (with all its severe constraints).

After all these discussions, we came up with a need to adopt our own identifier
that should be an  International Number that represent Language Resources and we
called it International Standard Language Resource Number (ISLRN).

We discussed in details whether such Identifier should bear some semantics in
its composition (to recognize that it refers to a textual or spoken corpus,
lexicon, ontology, video recordings, etc.) and agreed that such definitions are
still controversial and hence agreed to keep the ISLRN neutral while able to
represent all Language resources with a 13 digits format:

                          ISLRN:XXX-XXX-XXX-XXX-X
examples:

772-814-696-901-0<http://www.islrn.org/resources/772-814-696-901-0/>
500-657-957-472-7<http://www.islrn.org/resources/500-657-957-472-7/>
473-117-867-197-2<http://www.islrn.org/resources/473-117-867-197-2/>
642-875-857-557-9<http://www.islrn.org/resources/642-875-857-557-9/>



ELDA, LDC, O-Cocosda and AFNLP (tbc, pending internal discussions) agreed to
run such service on behalf f the community; other data centers will join once
the service is in operation. These will constitute the executive committee of
the ISLRN service.

The idea is to set up a web portal run and moderated by these organizations
where each owner/developer/producer/ of a Language Resource can request, free of
charge, an ISLRN for its resource. Those who need to reference such resource
should refer to the ISLRN as we have been doing with ISBNs since mid sixties
(see (fake) examples below).


The executive Committee will be steered by a steering committee in which all
major organizations will be represented, we are about to issue invitations for
a first general meeting next fall.

the structure we have suggested is:

(Steering Committee)
       |
       |
       |
       v
(executive committee)
       |
       |
       |
       v
(ISLRN service team)


The details about the modus operandi is described in a short position paper
and the details of the arguments for such initiative have been published, both
papers are available at:

The ISLRN proposal:
http://docs.islrn.org/Proposal-ISLRN-workingpaper-v04.pdf


The paper from IJCNLP 2011http://docs.islrn.org/ijcnlp2011-pid-v6.pdf


We will make an official announcement by end of March 2013 and requests for
ISLRN will start in April 2013.

All the best

Khalid CHOUKRI, ELRA on behalf of the ISLRN Executive Committee




P.S1. The ISLRN will make things easy, from the previous emails, whatever reference is used Impact factor and citation index will be computed on the basis of ISLRN


Davies, Mark. (2008-) //The Corpus of Contemporary American English: 450
million words, 1990-present//. Available online at
http://corpus.byu.edu/coca/.  ISLRN  642-555-213-127-4

British National Corpus, Version 3 (BNC XML Edition). 2007. Distributed by
Oxford University Computing Services on behalf of the BNC Consortium.
www.natcorp.ox.ac.uk<http://www.natcorp.ox.ac.uk/>;
ISLRN 143-765-223-127-3

Thompson, P., Iqbal, S. A., McNaught, J. and Ananiadou, S.. (2009). Construction
of an annotated corpus to support biomedical information extraction. In: BMC
Bioinformatics, 10:349http://www.biomedcentral.com/1471-2105/10/349/;  ISLRN:545-321-981-654-1


PS2:
   the idea of assigning ISLRN to resources should not prevent us from pushing
our colleagues to use it ... we noticed that even when a good reference exist,
many authors do not use it

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130318/6eec7690/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list