[Corpora-List] Tag sets

Nicolas Torzec torzecn at yahoo-inc.com
Wed Feb 6 09:00:13 UTC 2008


Hi,
Having working on a similar project a few years ago, I think the 
following references could be of interest for your project.


1) TEI: Text Encoding Initiative

The Text Encoding Initiative (TEI) is a consortium of institutions and 
research projects which collectively maintains and develops a standard 
for the representation of texts in digital form. Its major deliverable 
is a set of Guidelines, which specify encoding methods for 
machine-readable texts, chiefly in the humanities, social sciences and 
linguistics. The Guidelines define some 400 different textual components 
and concepts, which can be expressed using a markup language and defined 
by a DTD or XML schema.
=> http://www.tei-c.org/index.xml


2) NSW: Normalization of Non-Standard Words

@misc{ sproat-article,
       author = "Richard Sproat and Alan W Black and Stanley Chen and 
Shankar Kumar and Mari Ostendorf and Christopher Richards",
       title = "Article Submitted to Computer Speech and Language ",
       url = "citeseer.ist.psu.edu/537653.html"
     }
=> http://www.clsp.jhu.edu/ws99/projects/normal/


Hope this helps.
Nicolas

--
Nicolas Torzec
Yahoo! Inc.



maxwell at umiacs.umd.edu wrote:
> We're looking at annotating a small sample (~5k words) of Bengali text,
> and later maybe Urdu and Punjabi.  The annotation will be the dictionary
> citation form of each word.  The texts are mostly news articles, so there
> are a fair number of words for which there won't be any dictionary
> citation form.  These include many proper names, numerals, acronyms, and
> who knows what else.  I'll refer to these as "non-dictionary words",
> whereas "dictionary words" will include words whose citation form is in
> the dictionary we're using, even if the inflected wordform itself is not. 
> (We're doing this to test a morphological parser.)
>
> This is not quite the same as the inverse of named entity tagging, since
> some parts of names may have citation forms.  For example, in English one
> would tag "Mississippi River" as a name.  But "River" would be found in
> the dictionary, so for our purposes we would only want to tag
> "Mississippi" as a non-dictionary word.
>
> The simplest thing for us to do would be to just tag all such
> non-dictionary words the same way, e.g. with a tag "NOT".  However, in the
> interest of future uses to which we might put such a tagged text, it might
> be good to differentiate among the various kinds of non-dictionary words.
>
> We could easily make up our own tagset for non-dictionary words, but it
> strikes me that better would be to use some standard tagset for such
> words, if such a tagset exists.  There is a table of tagsets in Manning
> and Schutze pg. 141-2, including the Penn Treebank, Brown, and CLAWS. 
> However, the tagsets are English-specific.  This is especially noticeable
> in the punctuation tags for the PTB and Brown sets, but also e.g. in the
> decision to tag singular and plural proper nouns differently.  (Some
> languages attach case markers to proper nouns.)  Also, it appears that
> none of the tagsets distinguishes between numerals ('3', '4.5') and
> numbers written out ('three', 'four point five'), which we need to do, nor
> are acronyms distinguished from "symbols".
>
> Another distinction I thought about making is between "ordinary" Bengali
> names, and foreign names, since one might later want to develop a
> transducer to convert the latter into their more common Latin forms. 
> However, I suspect that might be too difficult a distinction for
> annotators to make, and in any case some well-known Bengali names are
> likely to have "standard" transliterations.
>
> Does anyone know of a semi-standard tagset that would be less
> English-specific, and would make the kinds of distinctions among
> non-dictionary words that we want to (or should) make?  Or should we just
> make up our own set?
>
>    Mike Maxwell
>    CASL/ U MD
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list