Q: Classification performance across languages and language families

Wed Jun 6 07:37:28 UTC 2012

[th - des réponses ont été envoyées sur la liste corpora]

Date: Sat, 02 Jun 2012 13:40:07 +0200
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <025501cd40b4$76198f90$624caeb0$@jrc.ec.europa.eu>
X-url: http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf
X-url: http://eurovoc.europa.eu/
X-url: http://langtech.jrc.ec.europa.eu/Eurovoc.html

A question and an invitation to discussion.

We recently carried out multi-label categorisation experiments
<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf> on
a mostly parallel set of documents in 22 languages, covering the
language families Germanic, Romance, Slavic, Hellenic, Finno-Ugric,
Baltic and Semitic. The document set is reasonably large (22K to 42K
documents per language), using the thousands of subject domain
categories from the EuroVoc thesaurus <http://eurovoc.europa.eu/> . The
performance across languages was rather uniform, with the exception of
the outlier Maltese, which performed considerably less well. The
languages covered are Bulgarian, Czech, Danish, Dutch, English,
Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian,
Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian,
Spanish and Swedish.

To my great surprise, the highly inflected agglutinative language
Hungarian produced the best results of all. The five Germanic languages
ended up in the top ten positions, the five Slavic languages in the
bottom half. The results for the other language families were less
consistent.

Q1: Does anyone have an intuition how these results could be explained?

Q2: Has anyone ran similar experiments with other types of classifiers
or data? Are the results similar?

My initial expectation had been that highly inflected languages would
perform less well and that feature space reduction using lemmatisation
would improve the results. However, our experiments for Czech, English,
Estonian and French (described in Ebrahim et al., forthcoming) showed
the contrary, rather consistently for all four languages and language
families: (1) lemmatisation reduces the performance and (2) adding
part-of-speech (POS) information to the word form and/or to the lemma
improves the performance.

Q3: Can we conclude that: the scarcer the feature space, the better the
classification performance?

Q4: If that were the case, why did Slavic languages (and Maltese)
perform less well in our experiments?

I would be pleased if you could share your own experience and/or your
opinions.

The classification tool (JRC EuroVoc Indexer JEX
http://langtech.jrc.ec.europa.eu/Eurovoc.html ) and the multilingual
document set can be downloaded from
http://langtech.jrc.ec.europa.eu/Eurovoc.html . Details of our
experiments are given in the two papers below.

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). JRC EuroVoc
Indexer JEX - A freely available multi-label categorisation
tool. Proceedings of the 8th international conference on Language
Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. (PDF
http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf )

Ebrahim Mohamed, Maud Ehrmann, Marco Turchi & Ralf Steinberger
(forthcoming). Multi-label EuroVoc classification for Eastern and
Southern EU Languages. In: Cristina Vertan & Walther v. Hahn:
Multilingual processing in Eastern and Southern EU languages -
Low-resourced technologies and translation. Cambridge Scholars
Publishing, Cambridge, UK.

Greetings,

Ralf

Ralf Steinberger 
European Commission – Joint Research Centre (JRC)
URL: http://langtech.jrc.ec.europa.eu/RS.html  

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------