[Corpora-List] JRC-Names - A freely available, highly multilingual named entity resource

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Thu Sep 8 22:45:20 UTC 2011


Readers of this list may be interested in the availability of this new named
entity resource.
 
WHAT IS JRC-NAMES
 
JRC-Names is a highly multilingual named entity resource for person and
organisation names ('entities'). It consists of large lists of names and
their many spelling variants (up to hundreds for a single person), including
across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.).
The named entity resource file with the list of spelling variants is
accompanied by Java-implemented demonstrator software that (a) allows to
produce - for any input name - a list of known spelling variants, and that
(b) analyses UTF8-encoded text files to find known entity mentions,
returning the name variant found, the preferred display name for that
entity, the unique name identifier for that name, the position of the entity
name in the text, and its length in characters. 
 
AN EXAMPLE
 
To see examples, go to any of the over one million entity pages on
EMM-NewsExplorer (e.g. that for Muammar Gaddafi at
http://emm.newsexplorer.eu/NewsExplorer/entities/en/262.html) to see the
list of spelling variants automatically collected for that entity. 
 
MOTIVATION FOR THIS RELEASE
 
The data release by the European Commission's Joint Research Centre (JRC) is
in line with the general effort of the European Commission to support
multilingualism, language diversity and the re-use of Commission
information. It follows the release of the JRC-Acquis parallel corpus in
2006 (over 1 billion words in 22 languages), of the DGT-TM Translation
Memory in 2007 (up to 2 million translation units per language in 22
languages), and further smaller multilingual resources. 
 
WHAT JRC-NAMES CAN BE USED FOR
 

JRC-Names is a technical resource that can be used to find names even if
they are spelled differently and to normalise name spellings in databases or
other repositories. It is also a useful ingredient for IT systems that
process text, e.g. for text mining, machine translation, social network
generation, and other text mining applications involving named entities. 
 
HOW JRC-NAMES WAS PRODUCED
 
JRC-Names is a by-product of the analysis of about 100,000 news reports per
day by the Europe Media Monitor (EMM) family of applications (freely
accessible at http://emm.newsbrief.eu/overview.html). 
 
It was mostly compiled automatically, by analysing hundreds of millions of
news articles since the year 2004 in up to twenty languages, identifying
names of entities (mostly persons, but also organisations, event names, and
more), and detecting which of these newly found names are variant spellings
of each other. Most name variants in JRC-Names are thus spellings that were
found in real-life text (including frequent spelling mistakes).
Additionally, for a subset of the collection of entities, software
automatically extracted spelling variants in many further languages (e.g.
Chinese, Thai, Japanese, ...) from the cross-lingual links in Wikipedia. For
highly frequent or otherwise important names, the named entity resource was
additionally manually verified. As JRC-Names was mostly produced
automatically, it will contain some errors.
 
MORE INFORMATION ON JRC-NAMES 
 
At http://langtech.jrc.ec.europa.eu/, you find more information on the JRC's
multilingual language technology activity, a download link for JRC-Names and
a reference paper explaining the named entity resource, as well as a page
pointing to other multilingual resources. 
 
WHAT NEXT?
 
The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near future. 
 
 
Ralf Steinberger  
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL - Applications: http://emm.newsbrief.eu/overview.html
URL - The science behind them: http://langtech.jrc.ec.europa.eu/ 



 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110909/8f6c243e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list