[Corpora-List] Extension of USAS Semantic Tagger Framework for Italian, Chinese and Dutch

Rayson, Paul p.rayson at lancaster.ac.uk
Thu Nov 14 12:26:32 UTC 2013


Hi all,

I'm pleased to announce the beta release of the USAS Semantic Tagger lexicons for Italian, Chinese and Dutch: http://ucrel.lancs.ac.uk/usas/

The USAS framework is now being extended to cover three more languages: Italian, Chinese and Dutch. The Java software framework has been modified to accommodate these languages, and semantic lexicons are compiled for them by automatically "translating" the English semantic lexicon entries, with some manual improvement where possible. Due to the inevitable ambiguity of translations and part-of-speech correspondence across and between languages, the automatically translated lexicons contain errors, which need to be cleared manually. A website interface is provided for users to test the semantic taggers. This is beta release of the tools, which will be improved in future.

Italian Semantic Tagger (http://phlox.lancs.ac.uk:8080/ucrel/semtagger/italian)

The Italian semantic tagger is being developed in collaboration with Dr Francesca Bianchi (Dip. di Studi Umanistici, Universita del Salento, Italy) and Prof. Elena Semino (Dept. of Linguistics and English Language, Lancaster University, UK). The original Java software framework has been modified by incorporating the TreeTagger Italian POS tagger. The English semantic lexicon entries have been automatically translated into Italian counterparts using FreeLang and other English-Italian Dictionaries with the help of Italian native speakers. Although some lexicon entries were manually checked, most of the entries were automatically generated and therefore they inevitably contain errors, which need to be cleared manually in future. Currently, there are two Italian semantic lexicons: single word lexicon (over 20,400 entries) and multi-word lexicon (over 4,100 entries which were manually checked).

Chinese Semantic Tagger (http://phlox.lancs.ac.uk:8080/ucrel/semtagger/chinese)

The Chinese semantic tagger has been developed by incorporating the Stanford Chinese word segmenter and the Chinese POS tagger into the USAS Java framework. The Chinese semantic lexicons have been automatically generated by translating the English semantic lexicons entries using a Chinese-English Dictionary (Xiao et al., 2010) and a LDC (Linguistic Data Consortium) English-Chinese Wordlist. Due to the different Chinese POS tags used in the Stanford Chinese POS tagger and Xiao et al.'s dictionary, their POS tags are mapped into a simplified common tagset to be used internally by the software system. The Chinese lexicon also employs a set of extended kinship semantic tags designed by Qian and Piao (2009). We are grateful for the assistance of Dr Richard Xiao (Lancaster University, UK) and Qian Yufang (Zhejiang University of Media and Communications, China) with this research. Currently the Chinese single word and multi-word unit semantic lexicons contain over 64,000 and over 19,000 entries respectively.

Dutch Semantic Tagger (http://phlox.lancs.ac.uk:8080/ucrel/semtagger/dutch)

The Dutch semantic tagger has been developed using a similar process to that of the Italian semantic tagger, using the Dutch version of TreeTagger. The Dutch lexicon has been compiled by translating the English semantic lexicon entries using a Dutch-English dictionary developed by (Tiberius and Schoonheim, 2014). As the Dictionary and TreeTagger use different POS tagsets, they are both mapped into a simplified common tagset to be used by the software system. Currently the Dutch semantic single word lexicon contains 4,203 entries. We are grateful for the assistance of Dr. Carole Tiberius (INL, Netherlands) with this research.

This research has been funded by the UCREL research centre at Lancaster University (http://ucrel.lancs.ac.uk/) and undertaken by Dr Scott Piao (http://www.research.lancs.ac.uk/portal/en/people/Scott-Piao/).

The beta versions of the lexicons can be downloaded from the website: http://ucrel.lancs.ac.uk/usas/ under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Alongside the beta release, we are hereby calling for further assistance and collaboration to continue editing and checking the lexicons for these and other languages. If you are interested in getting involved then please contact me and Scott.

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL and Senior Lecturer in Computer Science
Faculty of Science and Technology Director of International Teaching Partnerships
School of Computing and Communications, InfoLab21, Lancaster University, Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/~paul/
Tel: +44 1524 510357 Fax: +44 1524 510492

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131114/59f35dba/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list