[Corpora-List] Universal POS Tagset

Christian Chiarcos christian.chiarcos at web.de
Wed Feb 4 13:06:10 UTC 2009


Hi Adam,

there are numerous approaches for approaches involving tag set  
translation, tag set "interlinguas" or tag sets covering multiple  
languages, yet to my knowledge, all of these are restricted to a limited  
set of languages or a specific region.

The historically most important approach in this direction is probably  
represented by the EAGLES Recommendations for the Morphosyntactic  
Annotation of Corpora  
(http://www.ilc.cnr.it/EAGLES/annotate/annotate.html) that aim to provide  
a pan-European tag set, and the Multext-East standard extended these for  
Eastern European languages (http://nl.ijs.si/ME). However, it is highly  
questionable whether such standardization approaches can be extended far  
beyond a restricted region and a specific language family (cf. Khoja et  
al. 2001 on the development of an Arabic tagset independently from EAGLES).

So, I'm afraid to tell you that what you're looking for might not exist.

Still, natural candidates for cross-linguistic applicable sets of  
annotation values for POS annotation are the Data Category Registry  
(http://www.isocat.org/) or the General Ontology of Linguistic Description  
(http://linguistics-ontology.org/).

Yet, these do not represent tag sets in a strict sense, but general  
inventories of annotation terminology. The main difference is that  
annotation values in a tag set are mutually exclusive, whereas different  
levels of descriptions influencing the design of a POS tag set (syntax,  
semantics, morphology, lexical amiguity, ...) may overlap. So, attributive  
possessive pronouns ("*her* child") are pronouns on semantic and  
morphologic grounds, but syntactically determiners. The ordinal number in  
"I'm the first." is semantically a number, but syntactically a nominal  
(head of an NP), etc. A terminological repository may allow for such  
conceptual overlap, but a tag set needs to resolve these conflicts to  
justify the assignment of a specific tag, and they adopt different  
strategies and preferences to cope with such misclassifications, e.g., to  
tag a only cardinal numbers as numerals and ordinate numbers as  
adjectives, or to use the tag for determiner only if the determiner is not  
a possessive pronoun, etc.
These were just two examples from English as a familiar and  
well-understood language. Such conceptual mismatches substantially  
increase with the number of languages participating and what exact  
selection strategy lies behind a specific tag is basically arbitrary.

For morphological categories, there is a similar problem: Reference  
terminologies may tell you that there are labels for morphological case  
such as prepositional or locative, but they don't really tell you whether  
or not these labels refer to identical or distinct cases in one language  
or the other. Considering Russian, the prepositional case is occasionally  
referred to as locative -- this is actually only partly correct, as there  
are non-locative uses of the prepositional case  
(http://en.wikipedia.org/wiki/Locative_case#Slavic_languages). So, if  
you're going to investigate the distribution of locative case marking  
throughout the world, you may find that some Slavic languages have a  
locative (in their tag set), but others don't (because it is referred to  
as prepositional in these tag sets), but what you're evaluating, is in the  
end just a design decision of some tag set designer.

As for a more extreme example, consider the existence of a "verbal  
participle" in Inuktitut  
(http://www2.tu-berlin.de/fak1/el/board.cgi?id=angli&action=download&gul=124).  
Sounds like a participle as we know it and it would be probably tagged as  
such in a language-specific tagset, because it is an established term.  
However, as opposed to Indo-European participles, this is a finite verb  
(only that, by chance, it is systematically translated by an English  
progressive participle): the verbal participle is merely a specific mood  
of the verb indicating the temporal parallelity of multiple events, with  
normal verbal inflection. So, in the end, what specific conclusions can  
you draw of the existence of a tag "participle" there ?

To make a long story short, there is no universal POS tag set, and the  
right questions would have to be "Can there be a universal POS tagset at  
all ?" and "If applying it to my data, how much noise am I willing to take  
into account ?"

As you may guess, I have substantial doubts, not only because of the  
limited expressivity of tag sets (basically 1:1 matches: one tag = one  
language-specific category = one universal category = one phenomenon ?),  
but also because of the multitude of terminological traditions and  
linguistic disciplines involved (ranging from typology to NLP). Actually,  
a closer comparison between EAGLES (with a primary focus on NLP) and GOLD  
(with a primary focus on typology and language documentation) reveals  
quite a number of systematic mismatches in the conceptualization, e.g., in  
the subcategorization of nouns or pronouns/determiners/quantifiers. So, it  
seems that in its current state neither of these is to be regarded a  
terminological reference for cross-linguistic, cross-discipline linguistic  
annotation. However, GOLD is intended to be a community project, and so is  
the Data Category Registry, and possibly, these efforts converge one day  
into a general repository of annotation terminology usable to all  
linguists working with linguistic annotations.

But even then, they will not represent a tag set in a proper sense, for  
the reasons given above. The question then remains how to bridge the gap  
between such a general repository of annotation terminology (potentially  
overlapping, general concepts) and concrete tag sets (mutually disjoint,  
language- or tagset-specific tags). I do have a suggestion for this, but  
this certainly belongs to an independent thread ...

Best,

Christian
-- 
Christian Chiarcos
Universität Potsdam
Collaborative Research Center 632, Project D1 "Linguistic Data Base for  
Information Structure"
Co-project "Sustainability of linguistic data"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list