[Corpora-List] Universal POS Tagset
Christian Chiarcos
christian.chiarcos at web.de
Wed Feb 4 13:06:10 UTC 2009
Hi Adam,
there are numerous approaches for approaches involving tag set
translation, tag set "interlinguas" or tag sets covering multiple
languages, yet to my knowledge, all of these are restricted to a limited
set of languages or a specific region.
The historically most important approach in this direction is probably
represented by the EAGLES Recommendations for the Morphosyntactic
Annotation of Corpora
(http://www.ilc.cnr.it/EAGLES/annotate/annotate.html) that aim to provide
a pan-European tag set, and the Multext-East standard extended these for
Eastern European languages (http://nl.ijs.si/ME). However, it is highly
questionable whether such standardization approaches can be extended far
beyond a restricted region and a specific language family (cf. Khoja et
al. 2001 on the development of an Arabic tagset independently from EAGLES).
So, I'm afraid to tell you that what you're looking for might not exist.
Still, natural candidates for cross-linguistic applicable sets of
annotation values for POS annotation are the Data Category Registry
(http://www.isocat.org/) or the General Ontology of Linguistic Description
(http://linguistics-ontology.org/).
Yet, these do not represent tag sets in a strict sense, but general
inventories of annotation terminology. The main difference is that
annotation values in a tag set are mutually exclusive, whereas different
levels of descriptions influencing the design of a POS tag set (syntax,
semantics, morphology, lexical amiguity, ...) may overlap. So, attributive
possessive pronouns ("*her* child") are pronouns on semantic and
morphologic grounds, but syntactically determiners. The ordinal number in
"I'm the first." is semantically a number, but syntactically a nominal
(head of an NP), etc. A terminological repository may allow for such
conceptual overlap, but a tag set needs to resolve these conflicts to
justify the assignment of a specific tag, and they adopt different
strategies and preferences to cope with such misclassifications, e.g., to
tag a only cardinal numbers as numerals and ordinate numbers as
adjectives, or to use the tag for determiner only if the determiner is not
a possessive pronoun, etc.
These were just two examples from English as a familiar and
well-understood language. Such conceptual mismatches substantially
increase with the number of languages participating and what exact
selection strategy lies behind a specific tag is basically arbitrary.
For morphological categories, there is a similar problem: Reference
terminologies may tell you that there are labels for morphological case
such as prepositional or locative, but they don't really tell you whether
or not these labels refer to identical or distinct cases in one language
or the other. Considering Russian, the prepositional case is occasionally
referred to as locative -- this is actually only partly correct, as there
are non-locative uses of the prepositional case
(http://en.wikipedia.org/wiki/Locative_case#Slavic_languages). So, if
you're going to investigate the distribution of locative case marking
throughout the world, you may find that some Slavic languages have a
locative (in their tag set), but others don't (because it is referred to
as prepositional in these tag sets), but what you're evaluating, is in the
end just a design decision of some tag set designer.
As for a more extreme example, consider the existence of a "verbal
participle" in Inuktitut
(http://www2.tu-berlin.de/fak1/el/board.cgi?id=angli&action=download&gul=124).
Sounds like a participle as we know it and it would be probably tagged as
such in a language-specific tagset, because it is an established term.
However, as opposed to Indo-European participles, this is a finite verb
(only that, by chance, it is systematically translated by an English
progressive participle): the verbal participle is merely a specific mood
of the verb indicating the temporal parallelity of multiple events, with
normal verbal inflection. So, in the end, what specific conclusions can
you draw of the existence of a tag "participle" there ?
To make a long story short, there is no universal POS tag set, and the
right questions would have to be "Can there be a universal POS tagset at
all ?" and "If applying it to my data, how much noise am I willing to take
into account ?"
As you may guess, I have substantial doubts, not only because of the
limited expressivity of tag sets (basically 1:1 matches: one tag = one
language-specific category = one universal category = one phenomenon ?),
but also because of the multitude of terminological traditions and
linguistic disciplines involved (ranging from typology to NLP). Actually,
a closer comparison between EAGLES (with a primary focus on NLP) and GOLD
(with a primary focus on typology and language documentation) reveals
quite a number of systematic mismatches in the conceptualization, e.g., in
the subcategorization of nouns or pronouns/determiners/quantifiers. So, it
seems that in its current state neither of these is to be regarded a
terminological reference for cross-linguistic, cross-discipline linguistic
annotation. However, GOLD is intended to be a community project, and so is
the Data Category Registry, and possibly, these efforts converge one day
into a general repository of annotation terminology usable to all
linguists working with linguistic annotations.
But even then, they will not represent a tag set in a proper sense, for
the reasons given above. The question then remains how to bridge the gap
between such a general repository of annotation terminology (potentially
overlapping, general concepts) and concrete tag sets (mutually disjoint,
language- or tagset-specific tags). I do have a suggestion for this, but
this certainly belongs to an independent thread ...
Best,
Christian
--
Christian Chiarcos
Universität Potsdam
Collaborative Research Center 632, Project D1 "Linguistic Data Base for
Information Structure"
Co-project "Sustainability of linguistic data"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list