[Corpora-List] Q: Classification performance across languages and language families

Sat Jun 2 13:25:46 UTC 2012

Dear Ralf

I find the issue you've raised quite interesting and I too wonder why
Maltese should behave so differently. Like Adam, wondered about the quality
of the thesaurus at first. Perhaps that's not the reason, as you suggest.
But another reason -- also related to the relatively recent development of
vocabularies in certain technical areas in Maltese (Malta being bilingual,
most such technical areas were written about in English) -- might be
inconsistencies and/or variation in the way the documents in your set were
translated, which would also affect the distribution of lexical features
and the reliability with which they are associated with particular
categories. I am aware of an initiative in recent years among Maltese
translation bureaux to standardise some of the translations of technical
terms/phrases. (One of the problems seems to have been that, because
Maltese is Semitic, but has been heavily influenced by Romance, there is
often more than one possible translation for a given term. Another problem
is simply that translators, especially in the early days after Malta's
accession to the EU, would have relied on circumlocution and similar
"workarounds", before a vocabulary was gradually developed.) I guess the
more recent the document collection, the more likely it would be to avoid
such inconsistencies.

I've also taken a look at your LREC paper, mainly at Table 1, where your
precision/recall and other stats are reported. Here too, there are some
things which I find surprising. For example, why are there only 6 elements
in your stop-word list for Maltese, compared to much bigger numbers for
many other languages?

albert

On 2 June 2012 14:29, Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>wrote:

> Dear Adam,****
>
> ** **
>
> Thanks for your proposal and for allowing me to clarify: EuroVoc is a *classification
> scheme* with exactly the same 6700 subject domain classes in all
> languages, i.e. each class has a numerical identifier and exactly *one
> class* *label* that has been translated into all 27 or so languages.
> Example EuroVoc categories are ‘nuclear materials’, ‘Austria’, ‘fishery
> management’, ‘xenophobia’, ‘budget’, ‘population statistics’, ...****
>
> ** **
>
> I cannot see how such a classification scheme would favour one language
> over another, especially as the documents are parallel translations, as
> well: they have the same contents in all languages. EuroVoc is in no way
> comparable to a resource such as WordNet, which rather lists and organises
> existing words of a language, with varying coverage. ****
>
> ** **
>
> Greetings from Italy to the UK.****
>
> ** **
>
> Ralf****
>
> ** **
>
> ** **
>
> *From:* adam.kilgarriff at gmail.com [mailto:adam.kilgarriff at gmail.com] *On
> Behalf Of *Adam Kilgarriff
> *Sent:* 02 June 2012 14:13
> *To:* Ralf Steinberger
> *Cc:* corpora at uib.no; clef at dei.unipd.it; ln at cines.fr
> *Subject:* Re: [Corpora-List] Q: Classification performance across
> languages and language families****
>
> ** **
>
> Ralf,****
>
> ** **
>
> Please excuse scepticism, but what about the simple hypothesis that it all
> depends on thesaurus-quality.  My hunch would be that it started from a
> Germanic language, hence good performance there, and that Slavic lgs have
> been added more recently, so there have been less years for
> debugging/improving, and that there was a particularly inspired Hungarian
> translator!****
>
> ** **
>
> Maltese has a special problem - Maltese hasn't ever had a technical
> vocabulary so there was nothing the Maltese thesaurus-translators could do
> except make things up.****
>
> ** **
>
> (Of course I'll be happy to have my hypothesis quashed by someone who
> knows the history of Eurovoc)****
>
> ** **
>
> Adam****
>
> ** **
>
> On 2 June 2012 12:40, Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
> wrote:****
>
> A question and an invitation to discussion.****
>
>  ****
>
> We recently carried out multi-label categorisation experiments<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf>on a mostly parallel set of documents in 22 languages, covering the
> language families Germanic, Romance, Slavic, Hellenic, Finno-Ugric, Baltic
> and Semitic. The document set is reasonably large (22K to 42K documents per
> language), using the thousands of subject domain categories from the EuroVoc
> thesaurus <http://eurovoc.europa.eu/>. The performance across languages
> was rather uniform, with the exception of the outlier Maltese, which
> performed considerably less well. The languages covered are Bulgarian,
> Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek,
> Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
> Romanian, Slovak, Slovenian, Spanish and Swedish. ****
>
>  ****
>
> To my great surprise, the highly inflected agglutinative language *
> Hungarian* produced the best results of all. The five Germanic languages
> ended up in the top ten positions, the five Slavic languages in the bottom
> half. The results for the other language families were less consistent. **
> **
>
>  ****
>
> *Q1:* Does anyone have an intuition how these results could be explained?*
> ***
>
>  ****
>
> *Q2:* Has anyone ran similar experiments with other types of classifiers
> or data? Are the results similar?****
>
>  ****
>
> My initial expectation had been that highly inflected languages would
> perform less well and that feature space reduction using lemmatisation
> would improve the results. However, our experiments for Czech, English,
> Estonian and French (described in Ebrahim et al., forthcoming) showed the
> contrary, rather consistently for all four languages and language families:
> (1) lemmatisation reduces the performance and (2) adding part-of-speech
> (POS) information to the word form and/or to the lemma improves the
> performance. ****
>
>  ****
>
> *Q3:* Can we conclude that: the scarcer the feature space, the better the
> classification performance? ****
>
>  ****
>
> *Q4:* If that were the case, why did Slavic languages (and Maltese)
> perform less well in our experiments? ****
>
>  ****
>
> I would be pleased if you could share your own experience and/or your
> opinions.****
>
>  ****
>
> The classification tool (JRC EuroVoc Indexer JEX<http://langtech.jrc.ec.europa.eu/Eurovoc.html>)
> and the multilingual document set can be downloaded from
> http://langtech.jrc.ec.europa.eu/Eurovoc.html . Details of our
> experiments are given in the two papers below.****
>
>  ****
>
> Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). *JRC EuroVoc
> Indexer JEX - A freely available multi-label categorisation tool*.
> Proceedings of the 8th international conference on Language Resources and
> Evaluation (LREC'2012), Istanbul, 21-27 May 2012. (PDF<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf>
> )****
>
>  ****
>
> Ebrahim Mohamed, Maud Ehrmann, Marco Turchi & Ralf Steinberger
> (forthcoming). *Multi-label EuroVoc classification for Eastern and
> Southern EU Languages*. In: Cristina Vertan & Walther v. Hahn:
> Multilingual processing in Eastern and Southern EU languages -
> Low-resourced technologies and translation. Cambridge Scholars Publishing,
> Cambridge, UK.****
>
>  ****
>
> Greetings,****
>
>  ****
>
> Ralf****
>
>  ****
>
>  ****
>
>  ****
>
> *Ralf Steinberger* ****
>
> European Commission – Joint Research Centre (JRC)****
>
> URL: http://langtech.jrc.ec.europa.eu/RS.html  ****
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora****
>
>
>
> ****
>
> ** **
>
> --
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk>
>   ****
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>               ****
>
>                         *DANTE: a lexical database for English<http://www.webdante.com>
>                   *****
>
> ========================================****
>
> ** **
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- 
-----------------------------------------------------------------
Albert Gatt
Institute of Linguistics
Centre for Communication Technology Rm 402B
University of Malta
Tal-Qroqq Msida MSD2080
Malta

tel: (+356) 2340 2150
http://staff.um.edu.mt/albert.gatt/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120602/7810f899/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora