Soft: JEX - A freely available multi-label categorisation tool trained for 22 languages

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Thu May 17 15:45:03 UTC 2012

Date: Wed, 16 May 2012 12:57:23 +0200
From: Ralf Steinberger <ralf.steinberger at>
Message-id: <06c801cd3352$ad0c1f00$07245d00$>


The JRC EuroVoc Indexer JEX
( is readily trained
multi-label categorisation software that assigns categories from the
large-scale and wide-coverage EuroVoc Thesaurus
(  (consisting of thousands of
categories). JEX is being distributed together with its training data
(twenty to forty thousand documents per language). JEX has been trained
for 22 languages on mostly parallel text (texts and their professionally
produced translations). You can re-train JEX with your own documents,
and even using your own categorisation scheme. JEX provides a graphical
user interface (GUI), a command line option for batch processing, as
well as an API.


Languages:  Readily trained for 22 languages, but trainable for many more: 

  Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
  Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,
  Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish.

Language families: Germanic, Romance, Slavic, Hellenic, Finno-Ugric,
Baltic and Semitic.


Creator: European Commission – Joint Research Centre (JRC


JEX can be used fully automatically or as an interactive tool to support
professional librarians in their work.

JEX has also many potential uses in the field of Computational
Linguistics because it is highly multilingual and it lends itself to
cross-lingual tasks:

 - Use for multilingual classification experiments, e.g. to test the
   impact of different document representations, etc. (n-grams, lemmas,
   POS, word-sense disambiguation, …), across different languages and
   language families;

 - Use as input to other text mining applications, e.g.

 - Detect document translations (Pouliquen et al. 2004);

 - Cross-lingual plagiarism detection (Potthast et al. 2010);

 - Link related documents across languages (Pouliquen et al. 2008);

 - Support the lexical choice in Machine Translation;

 - Rank sentences in topic-specific summarisation;


At, you find more information on the
JRC’s multilingual language technology activity, download links for the
JRC EuroVoc Indexer JEX, as well as a page pointing to further freely
available multilingual resources. For details on JEX and its
performance, you can read the following publication, which you might
also want to use for scientific references:

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012).  JRC EuroVoc
Indexer JEX - A freely available multi-label categorisation
tool. (
Proceedings of the 8th international conference on Language Resources
and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.  Available at :

Ralf Steinberger, Mohamed Ebrahim & Marco Turchi
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy

URL – Applications:

URL – The science behind them:

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list