[Corpora-List] Universal POS Tagset

Mon Feb 2 15:35:40 UTC 2009

> I've been looking for a POS tagset that is general enough to
> effectively tag "any" natural language.  (I'm looking at Linguistic
> Typology / Universal Implications so I want to compare POS taggings
> across many [possibly obscure] languages.) Does anyone know of such a
> tagset?

One of the issues is going to be at what level of detail one wants the
tags.  If it's just the standard parts of speech (noun, verb,
pre-/post-position...), it might not be hard to come up with a list,
although there would be problems in particular languages (is the 'for' of
English for-to clauses a preposition or a complementizer, and is there
really a difference?).

If on the other hand, you want to tag things like person, number etc.,
which plenty of taggers have done, then there is a very long list of
features and feature values which one might tag.  There are for example
languages which, in addition to the usual singular/ plural distinctions in
the number feature, distinguish dual, trial, paucal, etc.; and languages
which have far different gender classes than are dreamed of in most
categorizations.  And there are languages which morphologically mark verbs
for such things as agreement with ergative and absolutive arguments, and
evidential status (seen/ inferred/ reportedly etc.).

Yet another issue for standardized tag sets is that some morphosyntactic
feature values will cover a wider range in one language than they might in
another, or values will overlap in different ways in different languages. 
Case systems are notoriously like that.

I know of two efforts to come up with lists of tags (in addition to the
responses you've already gotten).  One is the ISO TC 37/SC4 effort for
lexicons, which uses a "Data Category Registry" to register tags for use
in electronic lexicons; see http://www.isocat.org.  The last time I
looked, this struck me as rather Euro-centric, meaning that it might not
be a good fit for "possibly obscure" languages.

The other effort is the GOLD ontology,
http://linguistics-ontology.org/gold.html.  This ontology has been
populated by people who know about a very large variety of languages (with
initial input from a list compiled by SIL).  It is not really intended as
a list of tags (or of tag components), although you could use it that way,
but rather it is intended as something that a tag list could be defined by
reference to.  For example, it is common in Nahuatl to refer to the
'absolutive' form of a noun.  This has nothing to do with the ergative/
absolutive distinction, but it is nevertheless a standard usage among
Nahuatl (maybe even Uto-Aztecan) linguists.  The idea behind Gold is that
a Nahuatl linguist would continue to use the standard 'absolutive' term/
tag, but define it in terms of the categories in the Gold ontology.

   Mike Maxwell
   CASL/ U MD

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora