Ressource: JRC-Names, A freely available, highly multilingual named entity resource

Sat Sep 10 08:32:37 UTC 2011

Date: Fri, 09 Sep 2011 00:48:19 +0200
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <012301cc6e79$67ef5ca0$37ce15e0$%Steinberger at jrc.ec.europa.eu>
X-url: http://emm.newsexplorer.eu/NewsExplorer/entities/en/262.html
X-url: http://emm.newsbrief.eu/overview.html

Readers of this list may be interested in the availability of this new
named entity resource.

WHAT IS JRC-NAMES

JRC-Names is a highly multilingual named entity resource for person and
organisation names ('entities'). It consists of large lists of names and
their many spelling variants (up to hundreds for a single person),
including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese,
Chinese, etc.).  The named entity resource file with the list of
spelling variants is accompanied by Java-implemented demonstrator
software that (a) allows to produce - for any input name - a list of
known spelling variants, and that (b) analyses UTF8-encoded text files
to find known entity mentions, returning the name variant found, the
preferred display name for that entity, the unique name identifier for
that name, the position of the entity name in the text, and its length
in characters.

AN EXAMPLE

To see examples, go to any of the over one million entity pages on
EMM-NewsExplorer (e.g. that for Muammar Gaddafi at
http://emm.newsexplorer.eu/NewsExplorer/entities/en/262.html) to see the
list of spelling variants automatically collected for that entity.

MOTIVATION FOR THIS RELEASE

The data release by the European Commission's Joint Research Centre
(JRC) is in line with the general effort of the European Commission to
support multilingualism, language diversity and the re-use of Commission
information. It follows the release of the JRC-Acquis parallel corpus in
2006 (over 1 billion words in 22 languages), of the DGT-TM Translation
Memory in 2007 (up to 2 million translation units per language in 22
languages), and further smaller multilingual resources.

WHAT JRC-NAMES CAN BE USED FOR

JRC-Names is a technical resource that can be used to find names even if
they are spelled differently and to normalise name spellings in
databases or other repositories. It is also a useful ingredient for IT
systems that process text, e.g. for text mining, machine translation,
social network generation, and other text mining applications involving
named entities.

HOW JRC-NAMES WAS PRODUCED

JRC-Names is a by-product of the analysis of about 100,000 news reports
per day by the Europe Media Monitor (EMM) family of applications (freely
accessible at http://emm.newsbrief.eu/overview.html).

It was mostly compiled automatically, by analysing hundreds of millions
of news articles since the year 2004 in up to twenty languages,
identifying names of entities (mostly persons, but also organisations,
event names, and more), and detecting which of these newly found names
are variant spellings of each other. Most name variants in JRC-Names are
thus spellings that were found in real-life text (including frequent
spelling mistakes).  Additionally, for a subset of the collection of
entities, software automatically extracted spelling variants in many
further languages (e.g.  Chinese, Thai, Japanese, ...) from the
cross-lingual links in Wikipedia. For highly frequent or otherwise
important names, the named entity resource was additionally manually
verified. As JRC-Names was mostly produced automatically, it will
contain some errors.

MORE INFORMATION ON JRC-NAMES 

At http://langtech.jrc.ec.europa.eu/, you find more information on the
JRC's multilingual language technology activity, a download link for
JRC-Names and a reference paper explaining the named entity resource, as
well as a page pointing to other multilingual resources.

WHAT NEXT?

The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near future.

Ralf Steinberger  
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL - Applications: http://emm.newsbrief.eu/overview.html
URL - The science behind them: http://langtech.jrc.ec.europa.eu/ 

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------