[Corpora-List] Q: Classification performance across languages and language families

Sat Jun 2 11:40:07 UTC 2012

A question and an invitation to discussion.

We recently carried out multi-label categorisation experiments <http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf>  on a mostly parallel set of documents in 22 languages, covering the language families Germanic, Romance, Slavic, Hellenic, Finno-Ugric, Baltic and Semitic. The document set is reasonably large (22K to 42K documents per language), using the thousands of subject domain categories from the EuroVoc thesaurus <http://eurovoc.europa.eu/> . The performance across languages was rather uniform, with the exception of the outlier Maltese, which performed considerably less well. The languages covered are Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

To my great surprise, the highly inflected agglutinative language Hungarian produced the best results of all. The five Germanic languages ended up in the top ten positions, the five Slavic languages in the bottom half. The results for the other language families were less consistent. 

Q1: Does anyone have an intuition how these results could be explained?

Q2: Has anyone ran similar experiments with other types of classifiers or data? Are the results similar?

My initial expectation had been that highly inflected languages would perform less well and that feature space reduction using lemmatisation would improve the results. However, our experiments for Czech, English, Estonian and French (described in Ebrahim et al., forthcoming) showed the contrary, rather consistently for all four languages and language families: (1) lemmatisation reduces the performance and (2) adding part-of-speech (POS) information to the word form and/or to the lemma improves the performance. 

Q3: Can we conclude that: the scarcer the feature space, the better the classification performance? 

Q4: If that were the case, why did Slavic languages (and Maltese) perform less well in our experiments? 

I would be pleased if you could share your own experience and/or your opinions.

The classification tool (JRC EuroVoc Indexer JEX <http://langtech.jrc.ec.europa.eu/Eurovoc.html> ) and the multilingual document set can be downloaded from http://langtech.jrc.ec.europa.eu/Eurovoc.html . Details of our experiments are given in the two papers below.

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. (PDF <http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf> )

Ebrahim Mohamed, Maud Ehrmann, Marco Turchi & Ralf Steinberger (forthcoming). Multi-label EuroVoc classification for Eastern and Southern EU Languages. In: Cristina Vertan & Walther v. Hahn: Multilingual processing in Eastern and Southern EU languages - Low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge, UK.

Greetings,

Ralf

Ralf Steinberger 

European Commission – Joint Research Centre (JRC)

URL: http://langtech.jrc.ec.europa.eu/RS.html  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120602/6d11b127/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora