[Corpora-List] Universal POS Tagset

Damir C'avar dcavar at indiana.edu
Mon Feb 2 16:48:04 UTC 2009


maxwell at umiacs.umd.edu wrote:
>> I've been looking for a POS tagset that is general enough to
>> effectively tag "any" natural language.  (I'm looking at Linguistic
>> Typology / Universal Implications so I want to compare POS taggings
>> across many [possibly obscure] languages.) Does anyone know of such a
>> tagset?
>
>
>
> The other effort is the GOLD ontology,
> http://linguistics-ontology.org/gold.html.  This ontology has been
> populated by people who know about a very large variety of languages (with
> initial input from a list compiled by SIL).  It is not really intended as
> a list of tags (or of tag components), although you could use it that way,
> but rather it is intended as something that a tag list could be defined by
> reference to.  For example, it is common in Nahuatl to refer to the
> 'absolutive' form of a noun.  This has nothing to do with the ergative/
> absolutive distinction, but it is nevertheless a standard usage among
> Nahuatl (maybe even Uto-Aztecan) linguists.  The idea behind Gold is that
> a Nahuatl linguist would continue to use the standard 'absolutive' term/
> tag, but define it in terms of the categories in the Gold ontology.
>   

The GOLD ontology is missing some concepts (features and properties) for
some (maybe many) languages, but the process for extending it is
somewhat defined. There is e.g. a Google group where issues can be
discussed:

http://groups.google.hr/group/gold-ontology

Indeed, one good idea would be to have axioms and concepts getting into
GOLD, to extend its usability for a wider range of scenarios and
research questions. The comparisons you mention would be exactly what we
would like to see, e.g. some sort of typology of languages via
individual instantiations of GOLD (for the qualitative comparison, and
qualitative cross-dependencies between features), as well as via
annotated corpora for quantitative differences and similarities.

We used the GOLD Ontology in our morphological parser for Croatian
(CroMo), and we looked somewhat at the possibility to map it to other
common tagsets. Our goal was exactly this, being able to run qualitative
and quantitative similarity measures across languages and corpora via
some general tagset (and mappings of others to this one, so that we can
use existing corpora).

Mapping of e.g. MULTEXT (EAST) is somewhat possible (maybe somewhere
loosing specific properties that GOLD would have, but MULTEXT not etc.


ciao
DC


-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcavar.vcf
Type: text/x-vcard
Size: 225 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090202/e23856bb/attachment.vcf>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list