[Corpora-List] Tag sets

Tue Feb 5 17:36:38 UTC 2008

We're looking at annotating a small sample (~5k words) of Bengali text,
and later maybe Urdu and Punjabi.  The annotation will be the dictionary
citation form of each word.  The texts are mostly news articles, so there
are a fair number of words for which there won't be any dictionary
citation form.  These include many proper names, numerals, acronyms, and
who knows what else.  I'll refer to these as "non-dictionary words",
whereas "dictionary words" will include words whose citation form is in
the dictionary we're using, even if the inflected wordform itself is not. 
(We're doing this to test a morphological parser.)

This is not quite the same as the inverse of named entity tagging, since
some parts of names may have citation forms.  For example, in English one
would tag "Mississippi River" as a name.  But "River" would be found in
the dictionary, so for our purposes we would only want to tag
"Mississippi" as a non-dictionary word.

The simplest thing for us to do would be to just tag all such
non-dictionary words the same way, e.g. with a tag "NOT".  However, in the
interest of future uses to which we might put such a tagged text, it might
be good to differentiate among the various kinds of non-dictionary words.

We could easily make up our own tagset for non-dictionary words, but it
strikes me that better would be to use some standard tagset for such
words, if such a tagset exists.  There is a table of tagsets in Manning
and Schutze pg. 141-2, including the Penn Treebank, Brown, and CLAWS. 
However, the tagsets are English-specific.  This is especially noticeable
in the punctuation tags for the PTB and Brown sets, but also e.g. in the
decision to tag singular and plural proper nouns differently.  (Some
languages attach case markers to proper nouns.)  Also, it appears that
none of the tagsets distinguishes between numerals ('3', '4.5') and
numbers written out ('three', 'four point five'), which we need to do, nor
are acronyms distinguished from "symbols".

Another distinction I thought about making is between "ordinary" Bengali
names, and foreign names, since one might later want to develop a
transducer to convert the latter into their more common Latin forms. 
However, I suspect that might be too difficult a distinction for
annotators to make, and in any case some well-known Bengali names are
likely to have "standard" transliterations.

Does anyone know of a semi-standard tagset that would be less
English-specific, and would make the kinds of distinctions among
non-dictionary words that we want to (or should) make?  Or should we just
make up our own set?

   Mike Maxwell
   CASL/ U MD

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora