Corpora: Summary: Need for texts to evaluate named entity recognition software in En, Fr, De and Es

Wed Mar 27 18:34:43 UTC 2002

Thanks to Hamish Cunningham, Antonio S. Valderrábanos, Gabriel Pereira
Lopes, Fabio Ciravegna, Milena Slavcheva and Valerie Mapelli for replying to
our query, which is repeated at the end of this message.

In spite of receiving several replies to our query, we did not succeed in
getting a collection of texts containing named entities that could be used
for the evaluation of named entity recognition software. Most replies
pointed to general purpose collections of parallel text, but we needed a
specific test set of preferably short texts containing many named entities.

Finally, we compiled a test collection of 19 short parallel documents
ourselves. This was done by identifying a number of identical strings in
English texts and their Spanish translations, by hand-picking those strings
that were likely to be named entities, and by then choosing those short
texts that contained many of these hand-picked strings. This small test
corpus is available on request.

Below, you will find the main answers:

----------------------------------
Antonio S. Valderrábanos  wrote:

you could have a look at the following pages:
MLCC corpora (second part). A parallel collection in nine EC languages
Description at ELDA: http://www.elda.fr/cata/text/W0023.html
ECI Corpus. Contains different multilingual text collections; some of them
are parallel and may contain your language combinations.  Description at
ELDA: http://www.elda.fr/cata/text/W0004.html You may want to find a more
detailed description thant the one at ELDA.
CRATER
http://www.elda.fr/cata/text/W0003.html
A good one but doesn't contain German

Besides, the UN site (www.un.org) contains large amounts of parallel texts
(although not in German).

----------------------------------
Gabriel Pereira Lopes wrote:

Just pick up texts from European legislation in force:
http://europa.eu.int/eur-lex/

----------------------------------
Fabio Ciravegna wrote:

Johannes Matiasek at OFAI, Vienna, (john at ai.univie.ac.at) developed a named
entity recogniser for German as part of the Facile project. He annotated a
corpus with named entities. Maybe you can contact him and ask for the
corpus. He is a nice guy.
----------------------------------
Hamish Cunningham wrote:

we have work just starting on NE in french and german, and a system that
currently runs in english, bulgarian and romanian.  apart from that I don't
know of anything other than for english (but I think there must be some
stuff out there...)

----------------------------------
Milena Slavcheva wrote:

Below you can find information about a parallel German-French corpus. The
style of the texts is, broadly speaking, administrative and you can find
plenty of named entities. From the message below (that I forward to you), I
can see that the corpus is already distributed by ELRA. You can also turn
for information to prof. Wolfgang Teubert at teubertw at hhs.bham.ac.uk who was
the leader of the project producing GeFRePaC at the Institut fuer deutsche
Sprache (IDS) in Mannheim.

----------------------------------
Valerie Mapelli wrote:

I would like to refer you to the ELDA catalogue, where you may find corpora
of interest. Our web site: http://www.elda.fr Please do not hesitate to
contact me for any further information.

 -----Original Message-----
From: 	Ralf Steinberger [mailto:ralf.steinberger at jrc.it]
Sent:	Monday, March 18, 2002 5:15 PM
To:	CORPORA at HD.UIB.NO
Cc:	'Ralf STEINBERGER (JRC)'
Subject:	Need for texts to evaluate named entity recognition software in En,
Fr, De and Es

Hello,

we are looking for texts containing many named entities such as peoples'
names, company names, names of organisations/authorities and geographical
places in the languages English, French, German and Spanish.

The texts will be used for the evaluation of named entity recognition
software. Parallel texts (texts and their translations) would be preferred
as they would make the evaluation easier. It is not strictly necessary that
the named entities be marked up in the text.

The evaluation will be carried out by a student, who is writing her Master's
thesis on this subject, in collaboration with the EC's Joint Research
Centre. The thesis will be made publicly available.

Any hints are welcome. Thanks in advance.

Ralf Steinberger (ralf.steinberger at jrc.it)
European Commission, Joint Research Centre (http://www.jrc.it/langtech/)
Institute for the Protection and Security of the Citizen (IPSC)