[Corpora-List] Universal POS Tagset

Mon Feb 2 14:55:24 UTC 2009

Adam,

thanks for your interesting references. I've looked into development
of tag sets for part-of-speech tagging for English, Urdu, Arabic and
Malay:

Atwell, E. 2008. Development of tag sets for part-of-speech tagging. 
in: Anke Ludeling & Merja Kyto (editors) Corpus Linguistics: An 
International Handbook, Volume 1, pp. 501-526, Mouton de Gruyter. 
(preprint: http://www.comp.leeds.ac.uk/eric/atwell08clih.pdf)
http://www.degruyter.de/cont/imp/mouton/detailEn.cfm?isbn=978-3-11-021142-9

Corpus linguists have not been able to agree on a single poS-tagset for 
English, let alone a cross-language tag-set. The problem is the wide
range of (sometimes conflicting) criteria used in design of corpus PoStag
sets: "... mnemonic tag names; underlying linguistic theory; classification
by form or function; analysis of idiosyncratic words; categorization 
problems; tokenisation issues: defining what counts as a word; 
multi-word lexical items; target user and/or application;
availability and/or adaptability of tagger software; adherence to
standards; variations in genre, register, or type of language; 
and degree of delicacy of the tag set."

Perhaps a small PoS-tagset lacking "delicacy" or fine-grained
distinctions could apply across languages; e.g. the broad classes 
used by traditionla Arabic grammarians 
N (nouns) V (verbs) P (particles, i.e. others).
But arguably this is only useful to you if it reveals some syntacitc
universals, and I guess dividing all words into just 3 classes 
won't tell you much.

Eric Atwell,  Leeds University

On Fri, 30 Jan 2009, Adam Teichert wrote:

> Hello all.
>
>
>  I've been looking for a POS tagset that is general enough to
> effectively tag "any" natural language.  (I'm looking at Linguistic
> Typology / Universal Implications so I want to compare POS taggings
> across many [possibly obscure] languages.) Does anyone know of such a
> tagset?
>
>  If anyone is interested in what I've found so far, this paper seems relevant:
>    "Induction of Fine-grained Part-of-speech Taggers via Classifier
> Combination and Crosslingual Projection" (Elliott Franco Dr´abek,
> David Yarowsky)
>    http://acl.ldc.upenn.edu/W/W05/W05-0807.pdf
>
>  Also, I'm aware of some efforts at Microsoft Research India, to
> perhaps develop a "universal" tagset for Indian Languages:
>    http://research.microsoft.com/en-us/groups/mls/default.aspx
>
>
>  Thanks for any ideas.
>
>  --Adam (R. Teichert)
>
>   MS Student
>   School of Computing
>   University of Utah
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Eric Atwell,
  Senior Lecturer, Language research group, School of Computing,
  Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
  TEL: 0113-3435430  FAX: 0113-3435468  WWW/email: google Eric Atwell
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora