[Corpora-List] Q: Classification performance across languages and language families

Sat Jun 2 12:12:47 UTC 2012

Ralf,

Please excuse scepticism, but what about the simple hypothesis that it all
depends on thesaurus-quality.  My hunch would be that it started from a
Germanic language, hence good performance there, and that Slavic lgs have
been added more recently, so there have been less years for
debugging/improving, and that there was a particularly inspired Hungarian
translator!

Maltese has a special problem - Maltese hasn't ever had a technical
vocabulary so there was nothing the Maltese thesaurus-translators could do
except make things up.

(Of course I'll be happy to have my hypothesis quashed by someone who knows
the history of Eurovoc)

Adam

On 2 June 2012 12:40, Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>wrote:

> A question and an invitation to discussion.****
>
> ** **
>
> We recently carried out multi-label categorisation experiments<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf>on a mostly parallel set of documents in 22 languages, covering the
> language families Germanic, Romance, Slavic, Hellenic, Finno-Ugric, Baltic
> and Semitic. The document set is reasonably large (22K to 42K documents per
> language), using the thousands of subject domain categories from the EuroVoc
> thesaurus <http://eurovoc.europa.eu/>. The performance across languages
> was rather uniform, with the exception of the outlier Maltese, which
> performed considerably less well. The languages covered are Bulgarian,
> Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek,
> Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
> Romanian, Slovak, Slovenian, Spanish and Swedish. ****
>
> ** **
>
> To my great surprise, the highly inflected agglutinative language *
> Hungarian* produced the best results of all. The five Germanic languages
> ended up in the top ten positions, the five Slavic languages in the bottom
> half. The results for the other language families were less consistent. **
> **
>
> ** **
>
> *Q1:* Does anyone have an intuition how these results could be explained?*
> ***
>
> ** **
>
> *Q2:* Has anyone ran similar experiments with other types of classifiers
> or data? Are the results similar?****
>
> ** **
>
> My initial expectation had been that highly inflected languages would
> perform less well and that feature space reduction using lemmatisation
> would improve the results. However, our experiments for Czech, English,
> Estonian and French (described in Ebrahim et al., forthcoming) showed the
> contrary, rather consistently for all four languages and language families:
> (1) lemmatisation reduces the performance and (2) adding part-of-speech
> (POS) information to the word form and/or to the lemma improves the
> performance. ****
>
> ** **
>
> *Q3:* Can we conclude that: the scarcer the feature space, the better the
> classification performance? ****
>
> ** **
>
> *Q4:* If that were the case, why did Slavic languages (and Maltese)
> perform less well in our experiments? ****
>
> ** **
>
> I would be pleased if you could share your own experience and/or your
> opinions.****
>
> ** **
>
> The classification tool (JRC EuroVoc Indexer JEX<http://langtech.jrc.ec.europa.eu/Eurovoc.html>)
> and the multilingual document set can be downloaded from
> http://langtech.jrc.ec.europa.eu/Eurovoc.html . Details of our
> experiments are given in the two papers below.****
>
> ** **
>
> Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). *JRC EuroVoc
> Indexer JEX - A freely available multi-label categorisation tool*.
> Proceedings of the 8th international conference on Language Resources and
> Evaluation (LREC'2012), Istanbul, 21-27 May 2012. (PDF<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf>
> )****
>
> ** **
>
> Ebrahim Mohamed, Maud Ehrmann, Marco Turchi & Ralf Steinberger
> (forthcoming). *Multi-label EuroVoc classification for Eastern and
> Southern EU Languages*. In: Cristina Vertan & Walther v. Hahn:
> Multilingual processing in Eastern and Southern EU languages -
> Low-resourced technologies and translation. Cambridge Scholars Publishing,
> Cambridge, UK.****
>
> ** **
>
> Greetings,****
>
> ** **
>
> Ralf****
>
> ** **
>
> ** **
>
> ** **
>
> *Ralf Steinberger* ****
>
> European Commission – Joint Research Centre (JRC)****
>
> URL: http://langtech.jrc.ec.europa.eu/RS.html  ****
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120602/82587407/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora