27.3570, FYI: Release of JRC-Names: Multilingual Name Resource

Mon Sep 12 01:28:03 UTC 2016

LINGUIST List: Vol-27-3570. Sun Sep 11 2016. ISSN: 1069 - 4875.

Subject: 27.3570, FYI: Release of JRC-Names: Multilingual Name Resource

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry,
                                   Robert Coté, Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Sun, 11 Sep 2016 21:26:39
From: Guillaume Jacquet [guillaume.jacquet at jrc.ec.europa.eu]
Subject: Release of JRC-Names: Multilingual Name Resource

Dear all,

We are pleased to announce a new release of the JRC-Names multilingual name
resource, containing more information and now available as Linked Data.

JRC-Names is a highly multilingual named entity resource for person and
organisation names (called 'entities') developed by the European Commission’s
Joint Research Centre (JRC). JRC-Names consists of large lists of names and
their many spelling variants (up to hundreds for a single person), including
across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). For
example, the spellings Jean-Claude Juncker, Jean Cloud Junker, Jean-Claude
Juencker, Жан-Клод Юнкер, جان كلود جونكر, Ζαν Κλοντ Γιούνκερ, 让-克洛德•容克, and
many others have all been identified as referring to the 12th President of the
European Commission.

The resource is the by-product of the Europe Media Monitor (EMM) family of
applications, which has been analysing up to 300,000 news reports per day,
since 2004. EMM recognises names mentioned in the news in over twenty
languages and decides automatically for each newly found name whether it
belongs to a new entity or whether it is a spelling variant of a previously
known entity. This resource allows EMM users to display news about people or
organisations even if their names are spelt differently or if the news
articles are written in different languages and scripts.

JRC-Names has been available for download since September 2011, consisting of
name variant lists and accompanying software (JRC-Names text version
https://ec.europa.eu/jrc/en/language-technologies/jrc-names).

The new Linked Data resource
(https://data.europa.eu/euodp/en/data/dataset/jrc-names), accessible through
the European Union’s Open Data Portal (http://data.europa.eu/euodp/en/data),
offers more information compared to the previously released resource and tool,
including:

    - Titles and function names that have been historically found next to the
person mentions
    - Information about the time period during which name variants and their
titles were found
    - Various frequency counts
    - Links to other linked datasets such as DBpedia, New York Times Open Data
and Talk of Europe.

The JRC-Names RDF representation is based on lemon (Lexicon Model for
Ontologies),  a model developed by the W3C Ontology-Lexica Community group
which allows the expression of lexical information relative to ontologies. A
detailed description of JRC-Names Linked Data representation is given in the
reference paper mentioned below.

Examples of usage of the resource include, among others:

    - Entity linking, e.g. to deal with entity surface form variations;
    - Cross-lingual linked data-set query and mapping;
    - Search query expansion;
    - Machine translation;
    - Learning of transliteration rules;
    - Named entity recognition and disambiguation;
    - Cross-lingual document clustering.

This new Linked Data edition is available through a SPARQL endpoint
(https://data.europa.eu/euodp/en/data/dataset/jrc-names/resource/da30b11d-a07e
-45dd-bdb6-5f2ba5835d27) and via a RDF dump
(http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/EMM/JRC-Names/LATEST/jrcna
mes_uri.zip)
It is registered on the datahub.io portal as JRC-Names
(https://datahub.io/dataset/jrc-names-ec). Additional information is available
on this page of EU Open Data Portal
(http://data.europa.eu/euodp/en/data/dataset/jrc-names).

Examples of queries against the data-set include:

    - Given a person's name, retrieve all of its name variants
    - Given a person's name, retrieve all of its name variants in a certain
language
    - Given a person's name, retrieve all of its titles/function names in a
certain language
    - Given a variant and a language, retrieve the corresponding entity
    - Given a title and a language, retrieve all of the persons with this same
title.

Reference Paper:

JRC-Names: Multilingual Entity Name variants and titles as Linked Data,
Semantic Web Journal
(http://www.semantic-web-journal.net/system/files/swj1307.pdf)
Maud Ehrmann, Guillaume Jacquet and Ralf Steinberger (to appear, available
online since 04/20/2016)

Guillaume Jacquet, Maud Ehrmann, Ralf Steinberger
European Commission
Joint Research Centre
Text and Data Mining Unit
https://ec.europa.eu/jrc/en/language-technologies

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-3570	
----------------------------------------------------------